 # Stochastic optimization methods in deep learning?

In many situations, the problem can be reduced to solving a minimization problem, for which only incomplete information is available on the function that we want to minimize. Two problems are then mixed up: which algorithms to use to effectively approximate the function while maintaining a reasonable computation time, and what is the impact of imprecision on job knowledge? The first problem is an optimization problem, the second a stochasticity problem. Descent methods gradient, recursive methods that consist of updating the estimate of the minimizer in moving along the line of greatest slope, turn out to be relatively robust when we enter a stochastic framework. This is the idea developed by Robbins and Monro in 1951.  An essential point for such methods is the choice of the sequence of not in successive iterations. The first propositions use series of steps of divergent sum but of the sum of the square in order to make an appropriate compromise between bias and variance. However, using an idea from Polyak and Ruppert, we can use much larger pitches, which do not respect the second condition. The recently proposed algorithm achieves an optimal rate with a series of steps constant, in a Euclidean space. We recall some fundamental notions of statistical learning, in particular, the framework of the prediction: we seek to predict a variable of interest Y ∈ Y from an explanatory variable X ∈ X, with a sample of n observations. The observation that is independent and identically distributed by law P.

#### Definition 1:

We call a predictor any measurable map g from X to Y. All of these applications are denoted by S.

We would expect g (Xn + 1) to be a “good predictor” of Yn + 1. To define such a notion of good, we need to define a contrast.

#### Definition 2

We call contrast any function

L: S × (X × Y) → R

(g, (x, y)) 7 → l (g, (x, y))

We also define a loss function:

#### Definition 3

The loss function associated with contrast `is the expectation of the contrast:

Pl: S → R

g 7 → E [l(g, (X, Y))].

We call the Bayes predictor the best predictor with regard to the function of loss: s∗ = arg mins ∈S Pl (s). Our goal is to determine a predictor whose performance is as close as possible to that of the Bayes predictor. We can briefly cite a few fundamental examples:

- Regression: in this case Y = R et Y = η (X) + ε with η (X) = E [Y | X] the function regression. We can then consider the least-squares contrast: l(g, (x, y)) = (g (x) - y)2. In this context, the Bayes predictor is the regression function.

- The binary classification: Y = {0, 1}, with the contrast 0-1: l(g, (x, y)) = 1g (x)  = y. The

Bayes predictor is then s∗(X) = 1η (X) ≥ ½. How can we effectively determine, from the observations, a predictor whose performance is as close as possible to that of the Bayes predictor?

## Convexity

A fundamental point in optimization is the convex character of the function that we seek to optimize. The function P` is generally convex (it is if the contrast is convex in its first variable). This is not systematic: the 0-1 contrast is not convex! It is nevertheless possible to convey the risk, typically by using a convex contrast, satisfying good conditions. For this reason, subsequently, we will always assume the contrast function to be convex.

The gradient descent algorithms, initially introduced by Cauchy in 1847, are iterative algorithms that proceed by successive improvements to approach a minimizer of a differentiable or sub-differentiable function defined on a Euclidean space E or a Hilbert H. The fundamental idea is to follow, at each step, the direction of the steepest slope, which is exactly the opposite of the gradient.

To minimize a differentiable function f on H, the algorithm is therefore expressed as general in the form:

- Initialization: choose a starting point θ0 ∈ H

- Iterate: being obtained θk, determine the gradient ∇f (θk) and return θk + 1: = θk − γk∇f (θk).

These algorithms present convergence speeds generally depending on the hypotheses on the strong convexity or not of the function to be minimized, and on the choice of the sequence (γk) k footsteps.

## Stochastic optimization

In the following, we will sometimes be interested in linear predictors: gθ (x) = hθ, xi. We will note, therefore, indifferently gθ or θ.

In the stochastic framework, we do not have direct access to the gradient of the function. We are trying to minimize since we do not know the distribution law of (X, Y), so we don't know P.`

. We are therefore going to set up the following algorithm, called “algorithm of stochastic gradient”:

- Initialization: choose a starting point θ0 ∈ H

- Iterate: being obtained θk, determine an unbiased estimator ψk of the gradient ∇f (θk) and return θk + 1: = θk - γkψk.

Note: It is important to note that this algorithm can be applied to both approaches mentioned above: to the minimization of the penalized empirical risk as to the stochastic approximation. The crucial point is to be able to exhibit an estimator without gradient bias: in the following proposition, the function f above is sometimes Pn, `+ pen, sometimes P. `

In deep learning, the objective function that we seek to minimize is often non-convex and non-regular. The convergence of the descent of the gradient towards the overall minimum is therefore not guaranteed, and convergence even towards a local minimum can be extremely slow.

One solution to this problem is to use the stochastic gradient descent algorithm. The idea of the approach is to seek to minimize a function which can be written as the sum of differentiable functions. This process is then carried out iteratively on batches of data drawn at random. Each objective function minimized in this way is an approximation of the overall objective function. The equation following describes this method: Stochastic Gradient Descent (SGD) solves most of the problems encountered in deep learning.

### Momentum SGD:

The main objective of the method called Momentum is to speed up the descent process gradient and this by adding a velocity vector to the initial expression:

vt + 1 ← µvt - η∇J (θt),

θt + 1 ← vt + 1 + θt.

The vector vt + 1 is calculated at the start of each iteration and represents the update of the velocity of a "ball rolling down a slope." The velocity accumulates with iteration, hence the introduction of a hyper-parameter µ to dampen the velocity when reaching a flat surface. A good strategy can be to modify µ depending on the learning level.

In 1983, Nesterov proposed in a modification of the method of Momentum and showed that its algorithm presents a better theoretical convergence for the optimization of convex functions. This approach became very popular due to its performance in practice compared to the classic method. The main difference between Nesterov's method and the momentum method is that the latter starts by calculating the gradient at the current location θt before taking a step in the direction of the accumulated velocity, while the Nesterov momentum is first a calculation step to obtain an approximation of the updated parameter, denote by θet + 1, and then correct this step by calculating the gradient at this location. A step from Nestrov's momentum is described by: There are several variants of the descent algorithm of the gradient. In the following, we present three different methods quite similar to the methods presented here:

This method was proposed and studied by Polyak. The idea is to replace the calculation of the parameter θt, by the calculation of the mean temporal of these values, and this from the updates obtained by the descent of the gradient: The principle of this method, proposed in 2011, is to make the learning rate adapt to the settings so that it adjusts automatically, depending on the "sparsity" of the settings. Adagrad gradually lowers the learning rate but not in the same way for all the parameters: dimensions with a steeper slope see their rate lowered faster than gently sloping ones. More formally, the pas is described by: ### RMSProp Algorithm:

RMSProp Algorithm automatically adjusts the learning rate at each parameter, like Adagrad. However, it only cumulates the gradients from recent iterations. For this, he uses a sliding average. Here, δ (∇t) i is the sliding root mean square of the gradient. The division of the gradient of the objective function by the root of the sliding root mean (i.e., amplitude) improves convergence.

Adam is one of the most recent and efficient algorithms for descent optimization gradient. The principle is the same as for Adagrad and RMSProp: it automatically adapts the learning rate for each parameter. Its particularity is to calculate (mt, vt) "adaptive estimates of moments." It can therefore be seen as a generalization of the Adagrad algorithm: Here ‘mt’ is the first moment of the gradient (the mean), and vt is its second moment (non-centered variance) is a precision parameter. Its default value in the tool popular learning rate of CNN Caffe is 10-8. The parameters β1 and β2 are used to carry out execution averages on moments ‘mt’ and ‘vt’ , respectively.

Author: Vicki Lezama

Order now