What are Optimizers?

Optimizers are specialized algorithms that finetune a Neural Network’s parameters throughout training to improve the model’s performance.

How does it improve model performance?

It updates the parameters (weights and biases) of the neurons until the error between the output of the neural network and the actual output is minimal i.e. It finds the set of values for the parameters where the cost function is minimum.

So, how does it work?

Let’s say ‘Θ’ is a set of parameters that consists of two subsets of weights and biases, such that:

$$ Θ(W, B) \text{ where } W(w_1, w_2, \dots, w_n) \text{ and } B(b_1, b_2, \dots, b_n) $$

For every set of parameters, there exists a value for cost J(Θ) as the cost is a differential function of parameters Θ. The goal is to move along the function curve in order to ‘converge’ at a point (called ‘minima’) where the coordinates of that point (or parameters) will yield a predicted output that will have the least possible difference from the expected output.

_Predicted output.png

Imagine you’re somewhere on the curve. To have the optimum set of parameters you need to change the value of parameters so that J(Θ) can be minimum.

The optimizers obtain the optimum value by iterating over two steps:

Gradient calculation: or the current point's slope to determine the local minima's direction. In this case, if the current position of the Θ is in the left half of the parabola then will determine the minima towards the right thus increasing the value of Θ but if the initial point is in the right half of the parabola then the value of Θ should be decreased. (Gradient Descent)
Determining Learning Rates: The learning rate tells the algorithm how big of a magnitude change in the parameter is needed to reach the minima. (fixed Learning Rate)

The above two techniques are fundamentals in deciding an optimum set of parameters. Although these are the core techniques, they are outdated and have been outperformed by other techniques that help any model converge faster, especially in case of large datasets.

Better optimization methods:

Stochastic Gradient Descent (SGD): In contrast to Gradient Descent, which uses the entire dataset to update the model’s parameters, SGD is computed and updated on randomly selected batches of data. Due to not using the whole dataset to update the parameters and randomly selecting the data points in each iteration, each update of the parameters can have high variance. Although SGD has less computation cost as compared to GD it can still be noisy. Advantages • Frequent updates of parameters, hence converges in less time. • Can get out of local minima, reaching the global minima. Disadvantages • High variance in model parameters. • May shoot even after reaching global minima.
SGD with Momentum: To counter the noisy converging behavior of SGD, Momentum was introduced. It considers the previous updates of the parameters with the current update which makes the convergence faster even when the current updates are smaller. Advantages • Less variance and noise than SGD • Converges faster than SGD Disadvantages • May overshoot the global minima and settle for suboptimal values if momentum is too high. • Adds one more hyperparameter for fine-tuning (Momentum).
Adagrad (Adaptive Gradient Descent): Up until this point, we were focusing on finding the direction in which the model’s parameters should move to minimize the loss function with a constant learning rate. The problem with this approach is that if it is too high or too low it will directly affect the speed of convergence. Adagrad introduces the concept of adaptive learning rate that changes inversely with the gradient i.e. if the gradient is more/ steeper slope the learning rate or the steps should be small to not miss the minima but if the slope is almost flat the steps need to be bigger to see more changes in the loss function with bigger updates of the parameters. Advantages • Automatically adjusts learning rates based on parameters. • No need of manually tuning the learning rate. Disadvantages • Computationally expensive • Learning rate decays over time: Can stop learning altogether if the learning rate is too small
RMSProp: RMSProp fixes the learning rate decay problem of Adagrad by using a decaying average of squared gradients instead of their sum. This prevents the shrinking of the learning rate too quickly, allowing the algorithm to keep learning effectively. Advantages • Automatically adjusts learning rates based on parameter updates. • Can converge faster than Adagrad Disdvantages • Requires tuning of decay rate parameter.