Second order optimization methods for SGD convergence?

It seems that, especially for deep learning, there are dominating very simple methods for optimizing SGD convergence like ADAM - nice overview: http://ruder.io/optimizing-gradient-descent/

They trace only single direction - discarding information about the remaining ones, they do not try to estimate distance from near extremum - which is suggested by gradient evolution (\rightarrow 0 in extremum), and could help with the crucial choice of step size.

Both these missed opportunities could be exploited by second order methods - trying to locally model parabola in simultaneously multiple directions (not all, just a few), e.g. near saddle attracting in some directions, repulsing in the others. Here are some:

But still first order methods dominate (?), I have heard opinions that second order just don’t work for deep learning (?)

There are mainly 3 challenges (any more?): inverting Hessian, stochasticity of gradients, and handling saddles. All of them should be resolved if locally modelling parametrization as parabolas in a few promising directions (I would like to use): update this parametrization based on calculated gradients, and perform proper step based on this parametrization. This way extrema can be in updated parameters - no Hessian inversion, slow evolution of parametrization allows to accumulate statistical trends from gradients, we can model both curvatures near saddles: correspondingly attract or repulse, with strength depending on modeled distance.

Should we go toward second order methods for deep learning?

Why is it so difficult to make them more successful than simple first order methods - could we identify these challenges … resolve them?

As there are many ways to realize second order methods, which seems the most promising?

Update: Overview of methods including 2nd order: https://www.dropbox.com/s/54v8cwqyp7uvddk/SGD.pdf

2 Likes