An introduction to Artificial Neural Network Optimizers (Part 1)

Blog / An introduction to Artificial Neural Network Optimizers (Part 1)

An introduction to Artificial Neural Network Optimizers (Part 1)

Introduction: mathematical representation of Artificial Neural Network

As Machine Learning and Artificial Intelligence have become increasingly more popular within the business environment, researchers and machine learning practitioners have found themselves in a constant search to find better performing algorithms. In addition, the abundance of cheap computational power in combination with the widespread availability of qualitative datasets have prospered the development of computationally expensive machine learning algorithms such as Artificial Neural Networks (ANNs). As the name suggest, these networks are inspired by the biological neurons that can be found within a human’s brain and consist out of several neurons which are individually connected (Figure 1).

 

                                              Image for post

Figure 1: Abstract representation of interconnected neurons within a Deep Artificial Neural Network (DANN)

These neurons can be grouped within three separate parts, commonly referred to as layers: the input layer, the hidden layer(s), and the output layer. Networks with only one hidden layer are commonly referred to as ‘Shallow Artificial Neural Networks’ whereas networks with multiple hidden layers are referred to as ‘Deep Artificial Neural Networks’ (Figure 1).

Mathematically, a shallow — fully connected — Neural Network can be represented by means of the following mathematical model:

Image for post

With 𝑥 representing the neural network input data, 𝛽, 𝑤, and 𝑣 representing the network’s optimization parameter, 𝜎 being the networks activation function, 𝑛h being the number of neurons in the hidden layer, and 𝑦 being the network’s output.

Mathematical optimization of Artificial Neural Network

Artificial Neural Networks are no off the shelf algorithms. Each individual Network — being a simple one-layer model or a complex deep learning model — will need to be trained in order to complete the task that it was designed to carry out. During this procedure — which is often referred to as ‘training’ your model — the optimization parameters of the neural network are permuted in such a way that the model becomes increasingly better in the task to be completed. Whereas most of the times an analytical software packages takes care of the whole training procedure, what is going on behind the scenes is the search for the solution of a mathematical optimization problem.

For all optimization problems, the goal is to find the optimal parameter solution within all possible parameter combinations. Visually, this comes down to finding the global optimum — which can be either a minimum or a maximum — on the optimization surface.

In the case for convex optimization problems, there is only one (global) minimum, making the search for the optimum solution rather fast and straightforward (Figure 2a). However, for more complex optimization problems the optimization surface becomes non-convex, thereby introducing the concept of local optima which significantly complicates the search for the global, overall optimum (Figure 2b).

Image for post

Figure 2: graphical representation of the optimization surface of a convex (left (a)) and non-convex (right (b)) optimization problem

In the case of Artificial Neural Networks, the optimization surface is represented by the loss surface. That is, during network training, the goal of the optimization process is to find the network parameters that are associated with the minimum loss value and therefore the minimum prediction error.

As one would expect, the optimization of the neural network parameters takes place in a non-convex optimization surface (Figure 2b). Hence, there is a strong need for optimization solvers — usually referred to as optimizers — which can find the optimal parameter solution in an efficient and timely manner. In the following section, I will discuss the most important optimization algorithm that are available for neural network training.

Optimizers for Artificial Neural Network optimization

As discussed in previous sections, finding the optimal solution of a non-convex optimization problem is a lot harder than it sounds. The nature of its complexity has given rise to the development of a wide range of different optimization algorithms, all of which can find the a (sub-)optimal parameter solution in polynomial time. In this article, we will discuss the first three basic algorithms: Gradient Descent, Momentum, and Nesterov Momentum.

1. Gradient Descent and variations

Regular Gradient Descent

One of the most used — if not the most used — optimization algorithm within the field of Machine Learning is Gradient Descent (GD) and its various variations. In Gradient Descent, the network parameters are initialized in a random fashion, and are optimized by moving along the gradient within the loss function — with representing the parameter space of the neural network and representing a point within the parameter space.

The gradient is a vector-field (eq. 2) that is comprised of the partial derivatives of at its current point , and represents the direction in which the value of the loss function increases most rapidly. In addition, the negative of the gradient represents the direction of the loss function decreases most rapidly.

Thus, by moving along the negative gradient, the Gradient Descent algorithm ensures that the value of the loss function , causing the algorithm to converge in a local minimum after a finite number of iterations (Eq. 3)

Image for post

In addition, the Gradient Descent algorithm takes as input an additional parameter called the learning rate ( ). This is an optimizable parameter which allows one to control the time until convergence. Higher learning rates result in a faster convergence time but makes the solution more unstable. Lower learning rates result in slower convergence times, but more stable solutions.

The reason behind the popularity of regular Gradient Descent is rather straightforward: it is easy to understand, easy to implement, and does not require very advanced computations. However, when used in non-convex optimization problems, the Gradient Descent algorithm might get stuck in a local minimum — resulting in a sub-optimal solution. In addition, Gradient Descent only updates its weights from to after the neural network has been exposed to the entire training set — causing convergence times to increase drastically when dealing with large datasets.

Stochastic Gradient Descent

In order to cater for the disadvantages of regular Gradient Descent, researchers came up with a slight variant which they denoted Stochastic Gradient Descent (SGD). Instead of taking updating the network weights after the network has been exposed to the entire dataset, Stochastic Gradient Descent updates the network’s parameters every time it is exposed to a new single instance (Eq. 4). Given a theoretical dataset of 100 samples, Stochastic Gradient Descent would therefore already have executed 100 parameter updates by the time regular Gradient Descent executes its first parameter update.

Image for post

The advantages of this variant are manifold. First, Stochastic Gradient Descent drastically reduces the time that is needed to reach convergence due to the frequent parameter updates that are carried out. Second, due to the high frequency in parameter updates and the random distribution of training set instances, the optimization search will span a larger area of the loss surface, causing new minima — and possibly the global minima — to be found. However, whereas the latter is usually considered as an advantage, it might also cause the optimizer to shoot away from the global optimum to a suboptimal solution.

Mini-batch Gradient Descent

To solve the issues that are related to both regular Gradient Descent and Stochastic Gradient Descent, researchers came up with a hybrid approach called Batch Gradient Descent (BGD). Before Batch Gradient Descent can start, the training set is divided into i batches ( with the same number of samples. The network’s parameters are only updated after it has been exposed a full batch of instances — therefore resulting in a higher update frequency compared to Gradient Descent and a lower update frequency compared to Stochastic Gradient Descent.

Image for post

Whereas mini-batch Gradient Descent can already be considered as a quite performant optimization algorithm, it still has a few disadvantages like local minima and the selection of an appropriate learning rate. The following two optimization algorithms — being momentum and Nesterov Momentum — have been designed to avoid exactly that.

2. Momentum

Momentum optimization algorithms are an alternative to the Gradient Descent-based approaches discussed in the section above. Initially, momentum optimization methods were introduced to reduce the optimization step variance that is associated with rapidly converging optimization algorithms such as Stochastic Gradient Descent. Such algorithms implement frequent parameter updates in order to reach convergence faster but have the possibility to overshoot the global optimum — leading to suboptimal parameter selection.

Momentum-based algorithms reduce this risk by introducing a new term which represents the discounted accumulation of Gradients during previous update steps (Eq. 6, Eq. 7). The momentum parameter — which is a hyperparameter that effects the training procedure — determines how much information from the previous update steps is taken into consideration during the current update step, and usually equals 0.80–0.90.

Image for post

The advantages of using Momentum strategies are quite straightforward: by taking into consideration previous parameter updates, the current parameter update is less prone to high variance and oscillations caused by outliers in the training dataset. However, a major disadvantage that is associated with momentum strategies is that it introduces a new hyperparameter — which needs to be tuned and selected manually.

3. Nesterov Accelerated Gradient

Nesterov Accelerated Gradient (NAG) poses an improvement to regular momentum strategies by looking at future parameter values as well — therefore categorizing it as a look-ahead method. Since we know that the momentum term will be used to update the parameters, one can make an approximation on what the future parameter values will look like:

Image for post

By using this parameter combination instead of the current one, the optimizer slows down as it comes closer to a local optimum and does not overshoot. However, it also uses the momentum hyperparameter that is used in regular momentum strategies — which still requires tuning in selection.

Conclusion

A wide range of different Neural network optimizers exist — each of them having a series of advantages and disadvantages related to them. In this article, I gave a quick introduction into why we need neural network optimizers and what an optimization problem is. In addition, I introduced five of the basic optimization algorithms that are commonly used for doing neural network parameter optimization. However, next to the more basic algorithms — a series of more advanced optimization methods exist, including Adagrad, RMSProp, and Adam optimizers. These optimization techniques will be discussed in part two of this series on neural network optimization techniques.

Post your comment