Second order optimization methods for SGD convergence?

Jarek_Duda · February 21, 2019, 8:19am

It seems that, especially for deep learning, there are dominating very simple methods for optimizing SGD convergence like ADAM - nice overview: http://ruder.io/optimizing-gradient-descent/

They trace only single direction - discarding information about the remaining ones, they do not try to estimate distance from near extremum - which is suggested by gradient evolution (\rightarrow 0 in extremum), and could help with the crucial choice of step size.

Both these missed opportunities could be exploited by second order methods - trying to locally model parabola in simultaneously multiple directions (not all, just a few), e.g. near saddle attracting in some directions, repulsing in the others. Here are some:

L-BFGS: http://aria42.com/blog/2014/12/understanding-lbfgs
TONGA: https://papers.nips.cc/paper/3234-topmoumoute-online-natural-gradient-algorithm
K-FAC: https://arxiv.org/pdf/1503.05671.pdf
my second order local parametrization: https://arxiv.org/pdf/1901.11457

But still first order methods dominate (?), I have heard opinions that second order just don’t work for deep learning (?)

There are mainly 3 challenges (any more?): inverting Hessian, stochasticity of gradients, and handling saddles. All of them should be resolved if locally modelling parametrization as parabolas in a few promising directions (I would like to use): update this parametrization based on calculated gradients, and perform proper step based on this parametrization. This way extrema can be in updated parameters - no Hessian inversion, slow evolution of parametrization allows to accumulate statistical trends from gradients, we can model both curvatures near saddles: correspondingly attract or repulse, with strength depending on modeled distance.

Should we go toward second order methods for deep learning?

Why is it so difficult to make them more successful than simple first order methods - could we identify these challenges … resolve them?

As there are many ways to realize second order methods, which seems the most promising?

Update: Overview of methods including 2nd order: https://www.dropbox.com/s/54v8cwqyp7uvddk/SGD.pdf

Topic		Replies	Views
Random walk learning Lounge	1	383	May 31, 2018
Learning high order sequences Tangential Theories	6	1237	July 10, 2016
Why not apply SGD after every data sample? Machine Learning	12	1174	August 16, 2019
Yan LeCun on GI vs. current DL Science deep-learning	32	1250	June 16, 2023
Alternating Loss Functions for Evolution Algorithms Lounge	5	997	April 20, 2019

Second order optimization methods for SGD convergence?

Related topics