Predictions augmented by past memories of near future

TLDR Instead of predicting input at next time step from the recent past context, concatenate the past context with future contexts retrieved from an associative memory, using current context as a search key.

If the tldr above didn’t phased you out, here are some details.

We have the following elements:

  1. Some sort of algorithmic next-step predictor, let’s use “RNN” as a generic term. It could mean an (H)TM, a generative LLM, Echo State Network, etc… even an “actual” RNN, LSTM, GRU, whatelse.

  2. Some form associative memory which were mentioned pretty often here, but it could be a vector database or vector memory. In general is a tool capable of performing nearest neighbor search, or K-NN aproximator over a database of vectors.

The technique described below is about how to use 2 (the associative memory) to improve 1 (the RNN).

All RNNs have a limited size context which is a summary of the past. It is called “hidden size” in RNNs, or “cell activations” in HTM, or “context window” in generative transformers.

A common problem across all is that in order to improve predictions, the context size needs to get bigger, because it needs to encode relevant clues needed figure out the future. The longer the past the more complex and more possible interactions between those clues, so there is needed a “larger” context. Which slows everything down quadratically. A RNN’s weight # is proportional with the hidden state size squared. Similar issue with attention cost in transformers.

However, the prediction of token at time t becomes much easier when besides context(t-N,. up to…,t-1), we could have available the future context(t+N down to t) - which sounds kind of absurd since the future context didn’t happen.

And here-s where 2 (associative memory) comes to help. The next best thing besides knowing the near future is knowing what were the futures in contexts similar to the current one.

In short use (2) to use a current context embedding to retrieve (e.g.) three “past moments” which look most similar with the current one, having an exact pointer in time we can assemble three condensed “potential future contexts” and feed the actual predictor network not only the “recent past” context but also a short list of “what future might look” contexts.

Biological plausibility? Well, we seem to have a means to make predictions on longer term. I don’t think about the next letter I want to type, nor the next word, it’s a larger (yet less exact) idea of what I want to say.
So I don’t know, hypotheses about brains having some sort of associative memory capability keep popping out, this is just a clue about how we might use it.

2 Likes

I agree. I also happened to brainstorm something like your idea recently. It sounds plausible from a functional standpoint. Not sure about biological plausibility but I would be surprised if the brain doesn’t have similar capabilities.

2 Likes

The way I see biological plausibility is one should avoid using it strictly (as a veto criteria) in making engineering choices. One basic and undisputed reason is the difference in hardware substrate. Biological solutions were born not by attending to the holy principle of biologically plausibility, but by inventing & testing within the currently available framework.

So any design choice remotely resembling what we (speculatively) appreciate how/what biology actually does, has a fair enough base for tinkering with.


For whoever is interested in experimenting with the above idea I can help with two relatively cheap techniques:

  1. For “context” I experimented with different types of reservoir networks some of which can have nice, controllable properties. I will have to discuss these separately, the main idea is that similarly to Spatial Pooler which can be used to compress spatial features from a large representation space to a smaller, but relevant SDR, a reservoir can be used to compress a recent stream of input SDRs into a “summary” SDR that is both cheap to compute and consistently exhibits correlations over a certain interval.
    The reason reservoirs stayed out of mainstream DL is one can’t backpropagate through them but in HTM there is no such restriction.

  2. For associative memory I experimented with a so called SDR ID Map. Which can be useful as a short term memory (STM) over ~100k most recent states, with a (write_or_read) speed of (e.g.) ~4k/sec for 50/4000 sdrs.
    A couple reasons it is useful as STM:

  • unlike ANN indexing it simply adds new records to memory, there are no explicit deletes nor index rebuilds, and “useful” tokens can easily be refreshed which increases their retrieval likelihood.
    Yet unlike fixed context sizes, refreshed tokens would linger in short term memory for much longer than a couple 100k time steps.
  • unlike a vector database (which would also support random reads/updates) it is significantly faster.
1 Like