Papers with Optimization during inference?

Do you folks have any papers that involve “optimization during inference”? (Biologically plausible or otherwise).

I have a couple ideas here, and curious about relevant work. I recently found this one from DeepMind last year: Concept Learning with Energy-Based Models

Does AlphaZero count? It does tree search every move, is that “optimization during inference”? Or what does “optimization during inference” mean? Is it a synonym of “planning / foresight”, or if not how do they relate?

Maybe this is too broad to be useful? Like asking “Got ML papers that use weights?” :sweat_smile:
In the video, Yannic commented on the idea at 27:48 https://youtu.be/Cs_j-oNwGgg?t=1668

I am looking for a (gradient descent) optimization outer loop, iterating over data points, which backpropagates through some optimization inner loop of several small steps to get to one data point. Each weight would be used multiple times to predict one data point.

  • HTM, I believe, uses each weight once per data point: NO
  • Deep learning: only uses each weight once (But across N layers, so could argue it counts… but… ) NO
  • Energy Based Models: can perform gradient descent for several iterations just to make 1 prediction. (The above paper does a variation of this): YES
  • Meta-reinforcement learning: where there is an outer-outer loop, across tasks, and the traditional gradient descent inside each task. (But no smaller-than-task loop): NO
  • “Neural Ordinary differential equations” paper: backpropagate through an ODE solver several times for each data point/prediction: YES

Does AlphaZero count? […] Is it a synonym of “planning / foresight”, or if not how do they relate?

Yes I think that counts, I’ll take a look, thanks! Would you consider “planning/foresight” a specific modeling technique or just an idea?

@maxerbubba what you described is still too broad. The whole family of attention based models (transformers) could be viewed as “SGD outer loop + multistep optimization inner loop” because they compute similarity metric against other datapoints and use that in addition to applying regular model weights to produce a prediction. So I guess you can call attention an “optimization during inference”.
A slightly different idea is used in Capsule Network: https://arxiv.org/pdf/1710.09829.pdf where a small inner optimization loop is used in addition to SGD to compute the right path between layers for any given datapoint.