Login with Facebook

Stochastic optimization methods in deep learning?

In many situations, the problem can be reduced to solving a minimization problem, for which only incomplete information is available on the function that we want to minimize. Two problems are then mixed up: which algorithms to use to effectively approximate the function while maintaining a reasonable computation time, and what is the impact of imprecision on job knowledge? The first problem is an optimization problem, the second a stochasticity problem. Descent methods gradient, recursive methods that consist of updating the estimate of the minimizer in moving along the line of greatest slope, turn out to be relatively robust when we enter a stochastic framework. This is the idea developed by Robbins and Monro in 1951.  An essential point for such methods is the choice of the sequence of not in successive iterations. The first propositions use series of steps of divergent sum but of the sum of the square in order to make an appropriate compromise between bias and variance. However, using an idea from Polyak and Ruppert, we can use much larger pitches, which do not respect the second condition. The recently proposed algorithm achieves an optimal rate with a series of steps constant, in a Euclidean space. We recall some fundamental notions of statistical learning, in particular, the framework of the prediction: we seek to predict a variable of interest Y ∈ Y from an explanatory variable X ∈ X, with a sample of n observations. 

The observation that is independent and identically distributed by law P.

Definition 1:

We call a predictor any measurable map g from X to Y. All of these applications are denoted by S.

We would expect g (Xn + 1) to be a “good predictor” of Yn + 1. To define such a notion of good, we need to define a contrast. 

Definition 2

We call contrast any function

L: S × (X × Y) → R

(g, (x, y)) 7 → l (g, (x, y))

We also define a loss function:

Definition 3 

The loss function associated with contrast `is the expectation of the contrast:

Pl: S → R

g 7 → E [l(g, (X, Y))].

We call the Bayes predictor the best predictor with regard to the function of loss: s∗ = arg mins ∈S Pl (s). Our goal is to determine a predictor whose performance is as close as possible to that of the Bayes predictor. We can briefly cite a few fundamental examples:

- Regression: in this case Y = R et Y = η (X) + ε with η (X) = E [Y | X] the function regression. We can then consider the least-squares contrast: l(g, (x, y)) = (g (x) - y)2. In this context, the Bayes predictor is the regression function.

- The binary classification: Y = {0, 1}, with the contrast 0-1: l(g, (x, y)) = 1g (x)  = y. The

Bayes predictor is then s∗(X) = 1η (X) ≥ ½. How can we effectively determine, from the observations, a predictor whose performance is as close as possible to that of the Bayes predictor?


A fundamental point in optimization is the convex character of the function that we seek to optimize. The function P` is generally convex (it is if the contrast is convex in its first variable). This is not systematic: the 0-1 contrast is not convex! It is nevertheless possible to convey the risk, typically by using a convex contrast, satisfying good conditions. For this reason, subsequently, we will always assume the contrast function to be convex.

Gradient descent algorithms

The gradient descent algorithms, initially introduced by Cauchy in 1847, are iterative algorithms that proceed by successive improvements to approach a minimizer of a differentiable or sub-differentiable function defined on a Euclidean space E or a Hilbert H. The fundamental idea is to follow, at each step, the direction of the steepest slope, which is exactly the opposite of the gradient. 

To minimize a differentiable function f on H, the algorithm is therefore expressed as general in the form:

- Initialization: choose a starting point θ0 ∈ H

- Iterate: being obtained θk, determine the gradient ∇f (θk) and return θk + 1: = θk − γk∇f (θk).

These algorithms present convergence speeds generally depending on the hypotheses on the strong convexity or not of the function to be minimized, and on the choice of the sequence (γk) k footsteps.

Stochastic optimization

Stochastic gradient descent

In the following, we will sometimes be interested in linear predictors: gθ (x) = hθ, xi. We will note, therefore, indifferently gθ or θ.

In the stochastic framework, we do not have direct access to the gradient of the function. We are trying to minimize since we do not know the distribution law of (X, Y), so we don't know P.`

. We are therefore going to set up the following algorithm, called “algorithm of stochastic gradient”:

- Initialization: choose a starting point θ0 ∈ H

- Iterate: being obtained θk, determine an unbiased estimator ψk of the gradient ∇f (θk) and return θk + 1: = θk - γkψk.

Note: It is important to note that this algorithm can be applied to both approaches mentioned above: to the minimization of the penalized empirical risk as to the stochastic approximation. The crucial point is to be able to exhibit an estimator without gradient bias: in the following proposition, the function f above is sometimes Pn, `+ pen, sometimes P. `

In deep learning, the objective function that we seek to minimize is often non-convex and non-regular. The convergence of the descent of the gradient towards the overall minimum is therefore not guaranteed, and convergence even towards a local minimum can be extremely slow.

One solution to this problem is to use the stochastic gradient descent algorithm. The idea of the approach is to seek to minimize a function which can be written as the sum of differentiable functions. This process is then carried out iteratively on batches of data drawn at random. Each objective function minimized in this way is an approximation of the overall objective function. The equation following describes this method:


Stochastic Gradient Descent (SGD) solves most of the problems encountered in deep learning.

Momentum SGD: 

The main objective of the method called Momentum is to speed up the descent process gradient and this by adding a velocity vector to the initial expression:

vt + 1 ← µvt - η∇J (θt),

θt + 1 ← vt + 1 + θt.

The vector vt + 1 is calculated at the start of each iteration and represents the update of the velocity of a "ball rolling down a slope." The velocity accumulates with iteration, hence the introduction of a hyper-parameter µ to dampen the velocity when reaching a flat surface. A good strategy can be to modify µ depending on the learning level.  

Nesterov accelerated gradient descent.  

In 1983, Nesterov proposed in a modification of the method of Momentum and showed that its algorithm presents a better theoretical convergence for the optimization of convex functions. This approach became very popular due to its performance in practice compared to the classic method. The main difference between Nesterov's method and the momentum method is that the latter starts by calculating the gradient at the current location θt before taking a step in the direction of the accumulated velocity, while the Nesterov momentum is first a calculation step to obtain an approximation of the updated parameter, denote by θet + 1, and then correct this step by calculating the gradient at this location. A step from Nestrov's momentum is described by:

Gradient Descent Extensions

There are several variants of the descent algorithm of the gradient. In the following, we present three different methods quite similar to the methods presented here:

Average gradient descent:

This method was proposed and studied by Polyak. The idea is to replace the calculation of the parameter θt, by the calculation of the mean temporal of these values, and this from the updates obtained by the descent of the gradient:


The principle of this method, proposed in 2011, is to make the learning rate adapt to the settings so that it adjusts automatically, depending on the "sparsity" of the settings. Adagrad gradually lowers the learning rate but not in the same way for all the parameters: dimensions with a steeper slope see their rate lowered faster than gently sloping ones. More formally, the pas is described by:


RMSProp Algorithm: 

RMSProp Algorithm automatically adjusts the learning rate at each parameter, like Adagrad. However, it only cumulates the gradients from recent iterations. For this, he uses a sliding average.

Here, δ (∇t) i is the sliding root mean square of the gradient. The division of the gradient of the objective function by the root of the sliding root mean (i.e., amplitude) improves convergence.


Adam is one of the most recent and efficient algorithms for descent optimization gradient. The principle is the same as for Adagrad and RMSProp: it automatically adapts the learning rate for each parameter. Its particularity is to calculate (mt, vt) "adaptive estimates of moments." It can therefore be seen as a generalization of the Adagrad algorithm:


Here ‘mt’ is the first moment of the gradient (the mean), and vt is its second moment (non-centered variance) is a precision parameter. Its default value in the tool popular learning rate of CNN Caffe is 10-8. The parameters β1 and β2 are used to carry out execution averages on moments ‘mt’ and ‘vt’Stochastic optimization, respectively.

Author: Vicki Lezama

Need a custom

We will write it for you.
Order now

Free Essay Examples

Free essays:

All you need to know about Neuroendocrinology
All you need to know about Big data management
All you need to know about digital special effects
All you need to know Technical Writing?
Basics the Game Theory in Cryptoeconomics
Business innovation ideas for making money
Biosensors for cancer diagnosis
Business Analysis: Pricing strategies and Demand Curve
Cognitive Computing- How does Cognitive Computing work?
Consciousness: characteristics and peculiarities
Conservation Economics
Cybersecurity in business: challenges, risks, and practices
Demographic trends and how they affect Economic Growth
Dance as an art form and entertainment
Discrimination Economics
Determinants of Wages
Everything you need to know about short-term memory
Economic and Policy Impacts of Demographics
Ethics: an essay on the understanding of evil
Emotions: what are they? Theories explained
Factors of Demographic Data Collection
Factors Affecting Purchasing Behavior
Financial Statement Analysis
Factors Influencing Interest and Exchange Rates
Government's Intervention in The Labor Market
Guide on the Pathways of the nervous system
Game theory in microeconomics
Globalization: definition, causes, social impact and risks
How Relativism Promotes Pluralism and Tolerance
How to use the audience’s feedback to write a news report
History of silent cinema
How news report can be strengthened through multimedia
Introduction to Population Problems
Imperfect Information and Asymmetric Information
Imperfect Information in Insurance
Introduction to Labor Markets
Journalism: What is News?
Journalism: Broadcast media and Television Presenters
Journalism: Sources of News
Journalism and Law
Key Determinants of National Income
Key Factors That Affect Pricing Decisions
Kinetic models in biology and Related fields
Know about the different forms of traditional African dances
Latest technology trends
Latest dance trends
Magnetoencephalography (MEG)
Microeconomic Analysis to the Demand for Labor
Neuromuscular disorders
National Economies, Fluctuation, and Responses to Fluctuations
Neurotransmitters: what they are and different types
Nanomedicines to target tumors
Objections to utilitarianism
Organizational motivation and its effects
Overcoming Hiring Challenges for Nonprofit Organization
Population Demographics
Recurrent neural networks (RNN) for speech detection
Russian School of Mathematics
Research and Development
Risk Sharing in Insurance and Asset Markets
Stochastic optimization methods in deep learning?
Structure of the nervous system
Structure of a Corporation
Schizoaffective disorder: how to live better with it
The climate change denial
The techniques of basic cinematography
The Endosymbiotic Theory
The Role of Internal Audit in Corporate Risk Management
Utilitarianism Vs. Kantianism
Understanding Auctions and Auction Theory: Part 2
Various theoretical perspectives of sociology
Virtual reality, what it is and how it works
What are the linear models in machine learning?
What is Convolutional Neural Network
4 Facts about Origin of Mathematics!
5 techniques to create an animation
10 emerging technologies according to World Economic Forum
10 strategies to maximize corporate profits
3d Model Of Building
6 Medical Technologies that revolutionized the healthcare in 2020
All you need to know about the ACA Code of ethics
Architecture and Democracy: An Introduction
Architecture and Democracy: Democratic Values
Architecture and Democracy: Democratic Procedures
All You Need to Know About a Synthesis Essay
An essential guide to understanding Film Theory
Application of Artificial Intelligence in Cyber Security
Applications of electrical engineering
Augmented reality: what it is, how it works, examples
Advantages And Disadvantages Of Social Networking
All you need to know about Cryptography
Applications of astrophysical science
All you need to know about architecture engineering
Applications of geological engineering
Artificial intelligence and medicine: an increasingly close relationship
An insight into Computational Biology
ACA code of conduct
A Rose for Emily
Applications of Mathematics in daily life
Architecture mistakes to avoid
All you need to know about Toxicology
All you need to know about Holistic Medicine
All you need to know about linguistics
An introduction to Linguistics and its subfields
All you need to know about Anxiety disorder
All you need to know about Drones
A Brief Insight into Political Science
Assumptions related to feminism
All you need to know about Byzantine emperors
All you need to know about labour economics
An insight into xenobots -the first-ever robots
An ultimate guide about Biomaterials
A Comprehensive Introduction to the Mona Lisa
Analysis methods of Transport through biological membranes
An ultimate guide about biochemical reactions
Analysis of brain signals
Artificial Gene Synthesis
Application to synthetic biology of CAD methods
All you need to know about metabolic pathways
Applications of BIOMEMS
All you need to know about the epidemiology
Asian vs. western leadership styles
All you need to know about Smart prosthesis
Analysis of Economy: Output of Goods and Services (GNP), and GDP on Economic success
A Guide to Pricing Strategies
An Overview Of Economic Studies
Analysis of Fiscal and Monetary Policies
Analysis of Business Cycles
Analysis of Consumption and Investment
A Look into Regression Analysis
Analysis of Household's Consumption and Savings Behavior
All you need to know about Capital Budgeting
All you need to know about risk management
Art looted in wartime.
Appropriate use of Data in Economics
All you need to know about reaction kinetics?
A historical overview of Financial Crises
All you need to know about management discipline?
An insight into the error-correction models
All you need to know about Data visualization
All you need to know about Work-family balance
All you need to know Technical Writing?
All you need to know about digital special effects
All you need to know about Big data management
All you need to know about Neuroendocrinology
How to Write a Personal Essay
Housing Needs in America
How to Write a Description Essay
How to Create an Excellent Scholarship Essay?
How to write a cause and effect essay
How to Hire the Best Essay Writing Service Provider?
How to Write a College Application Essay?
How to get the most out of your English lectures
How to write Expository Essay
How to succeed in your psychology class?
How to Write an Academic Essay in the Shortest Time?
History of Journalism
How Different Sectors are Using Artificial intelligence (AI)
How to write an informative essay
How to deliver persuasive essays?
How to Give a Convincing Presentation
How to write an essay on leadership?
Historical Art Still Around Today
Humanoid robot: what it is, how it works and price
History of Chemistry
Healthcare Advanced Computer Power: Robotics, Medical Imaging, and More
Healthcare AI: Game Changers for Medical Decision-Making and Remote Patient Monitoring
How to understand different types of English
How to Cope with Chronic Pain
How African American choreographers and dancers have influenced American dance
How mobile robot can do in logistics or in production
How To Become a Successful Entrepreneur
History of the Philosophy of Feminism
How is the climate changing?
How to Track Your Content Marketing ROI
How to Gun control In the USA?
Historical and contemporary role of labour in the modern world
How breast cancers are classified?
How the cells of our body communicate?
How the Lymphatic System Works?
How Digestive System Works
How to complete your capstone projects effectively?
How to write a research project
Healthcare technologies that help patients with better self-management
How to choose the topic of the senior capstone project
How to make your business survive at economic crisis
How can immigrants blend in the American society?
How does the economics of war affect society?
Hate speech on social media.
How to Build an Economic Model
How to start a healthcare startup?
How can financial illiteracy harm you?
How cancer is developed - Cancer biology
How to define the Enterprise Value
How to conduct economic research?
How women can manage sexual harassment
How to use quotes in your news reports?
How news report can be strengthened through multimedia
History of silent cinema
How to use the audience’s feedback to write a news report
How Relativism Promotes Pluralism and Tolerance
Introduction to Urban Studies
Importance of dance in education
InMoov: how to build an open source humanoid robot
Importance of KYC verification to making the Blockchain secure
Importance of Rhythm
Importance of dance student evaluation
I/O control methods -types and explanations
Identity theft: what to do?
Introduction to Utilitarianism
Importance of 3d Modelling in Architecture
Importance of online journalism
Image processing in medical diagnosis
Introduction to USA Politics
Introduction to Comparative Politics
International Relations as a Major in Political Science
Importance of modern trade policy
Introduction to Journalism
Introduction to Writing a TV Script
Introduction of Microfabrication techniques
Introduction to Microeconomics
Interaction of Consumer and Firm Choices in Markets
Importance of corporate sustainability
Issues in International Monetary Macroeconomics
Introduction to Statistics and Data for Economics
Introduction to Data and Probability for Economics
Introduction to the Game Theory
Introduction to Econometrics
Introduction to Economic Information
Introduction to Market Equilibrium
Introduction to Economic Models and Application
Introduction to Empirical Research
Introduction to Econometric Data
Importance of Critical Thinking, Principles, and Goals
Introduction to Identification and Causal inferences
Introduction to Econometric Application
Intermediaries and Government in Financial Crisis
Importance and seven principles of quality management
Illiteracy in the USA
Introduction to Economics of Law
Introduction to Coase Theorem
Introduction to Social Choice and Incarceration
Intellectual Property and Product Liability
Investment in Human Capital
Introduction to Labor Markets
Imperfect Information in Insurance
Imperfect Information and Asymmetric Information
Introduction to Population Problems
The Looming Energy Crisis in America
Top benefits of performance-based engineering
The More Languages You Know, The More Times You Are a Man
Things to consider while writing an Argumentative Essay
Top Ways to Improve Your Academic Writing Skills
Tips to Excel in Creative Writing
The origins of films in the early 19th century
Top career options in Architecture
The Elevator Pitch
Top finance trends 2020
The basic Structure and functionality of robots
The Way to Success
The election system of the President in the United States of America
Two-party System in United States of America
Top trends in urban design
The history and theory of African American filmmaking
Top benefits of creative writing
Tinnitus Guide: Common Symptoms and Treatment Options
The language of dance
The digital image processing management
Top famous politicians of the World
Top methods of political science!
The history of the feminist movement
The blood flow in cardiovascular system - Biofluid Mechanics
The best of Leonardo Da Vinci
The Structure and Function of Macromolecules
The structure of cell: a research on the bricks of the human body!
Tissue and organ construction: Adhesion and recognition between cells
The kinetics of the transformation processes
The Modeling of Biological Systems
Tips for writing a great thesis statement
The Defense mechanisms against infections
The impact of the technological innovations in medicine
Top journalism trends to know about
The relation between mass media & politics
Theranostics: Diagnosis and Care through Nanoparticles
The practical Applications of X-rays
The applications of Ultrasound in medicine
Transfer mechanisms of genetic information in Bacteria
The regulation of cellular metabolism in the diagnosis
The Principles of MRI Contrast agents
The technical basis of optical coherence imaging
The New Media: Emerging Trends
The Structure of Interest Rates and the Yield Curve
Technological perspectives and reflections of neural engineering
Types of bioreactors and their applications
The Role of Government Policy in Improving Economic Outcomes
Types of corporate responsibility
The Role of IMF in International Monetary Macroeconomics
Tools for investment decision making
The concept of Organizational Culture and its applications
The Conduct of Monetary and Fiscal Policy
The Basics of Financial Accelerator Models
Tips for labeling medical devices- Medical Entrepreneurship
The different medical imaging techniques
The Economics of Uncertainty – Introduction
Theories of Public Policy
The Game Theory in Social Media
The political theory of Thomas Hobbes
The Use of Law on Economics and Vice Versa
The Role of Internal Audit in Corporate Risk Management
The Endosymbiotic Theory
The techniques of basic cinematography
The climate change denial
What is a Definition Essay?
What are diagnostic essays?
What is the relation between art structural engineering?
What is a Narrative Essay
What are robotics and intelligence systems?
What are the benefits of studying health sciences?
What is artificial intelligence and why it matters?
What is comparative Literature?
Why study neuroscience
What is Wi-Fi and how does it works
What is French history famous for?
What are Humanistic Studies?
What is covered in Biophysics?
What is modern journalism?
What is Virtualization? Benefits & Applications
What are modern public relations?
What is plasma physics?
What is teacher preparation?
What is rapid prototyping for 3D printing?
What is contemporary European Politics?
Why should you learn American Ballet?
What is engineering physics?
What is the purpose of African American Literature?
Ways to learn the Rhythm
What is digital art used for?
What are Enzymes and how do they work
Who is the father of political science?
Why Study Political Science - Job?
What is the Philosophy of Feminism?
What is a quantum computer?
Ways B2B Startups Streamline Their Conversion Strategies
Why do biomedical signals need processing?
What are the long term effects of climate change?
Why study labour relations
What is Holoprosencephaly?
What is antisocial disorder?
What are the important principles of evolution?
What is the cytoplasm and its function?
What is biopolymers?
What Makes a Good Leader
Women empowerment in modern generation
What is the history of political thought?
What is Gene recombination
What is synthetic biology
What is business cost analysis?
What is Inflation
What are the consequences of unemployment?
What is lithotripsy and its types?
What is transition elastography?
What is the purpose of deep brain stimulation?
What is a Brain-Computer Interface (BCI)
What is neuroethics?
What is Market and Supply and Demand
What is optogenetics?
What are the techniques to record brain activity?
What happens if the interest rate increases?
What is immunotherapy?
What is the economic role of the financial market?
What are the factors behind illegal immigration?
What is the lymphocyte activation?
What is financial market and its types?
What is the structure of financial markets?
What are the methods of measuring business performance?
What is the Credit market?
What is business ethics and code of ethics
What are the Causes of financial instability?
What is MBA with Concentrations
What is regenerative medicine?
What is Population ecology?
What is Microfinance: evolution, and practices?
What is biotechnology and its applications?
What are Workplace diversity and its benefits?
What is the difference between a leader and a manager?
What Is Branding and best branding Business strategies?
Why are microelectronics important?
What are biologic drugs.
What is the Foreign Exchange market?
What is the role of scientific research in times of crisis?
What are the risks of international trade?
What is financial management?
What is gene therapy?
What is education economics?
What is regression analysis, and why should you use it?
What Is Technology Marketing And How Should It Work?
What is Management Accounting
What are the methods of valuation of companies?
What is Immune System and Immunotherapy?
What is big data analytics?
What is the 7 layers of OSI model?
What is Neuroplasticity?
What are Sculpture art and its types?
What are the different genres of films?
What is Transcranial magnetic stimulation (TMS)?
What is TES-Transcranial electrical stimulation?
What is Relativism?
What is Vaccine skepticism, and what to do about it?
What happens in the brain when learning?
What is the deep neural network?
What is Convolutional Neural Network
What are the linear models in machine learning?