Associative Memory via Predicitve Coding

I read this paper on associative memory where they obtained impressive results by using predictive coding. Apparently the mechanism is biologically plausible.
https://arxiv.org/abs/2109.08063v1
I have to say they explain themselves rather poorly and difficult to grasp entirely what they are doing.
Anyway it seems they have gone further than I managed to get with locality sensitive hashing based associative memory. Where you use a random projection followed by a non-linear function (eg. binarization) to produce approximately linearly independent vectors whose element values are densely distributed in magnitude terms. Followed by a weighted sum based readout layer.
Linear independence allows maximum storage capacity in weighted sums.
Non-Sparse distribution allows lowest noise recall (see the variance equation for linear combinations of random variables.)
Example code:
https://editor.p5js.org/siobhan.491/sketches/k7UePTA4H
If I can understand the paper better perhaps I can improve my code. :honeybee:

3 Likes

Consider emailing the first author, heā€™s a grad student and would probably love to discuss it with you (Tommaso Salvatori: tommaso.salvatori@cs.ox.ac.uk).

1 Like

Perhaps, perhaps, perhaps. Like the song.
Anyway it jogged me into thinking up a couple of new ideas with random projection based associative memory.
A: The Walsh Hadamard transform leaves vector length unchanged and you can use that to avoid setting a learning rate in certain circumstances. You can train at full power, though you may not always want to if you have noisy data for example.
B: In certain arrangements you only need to do one cheap random projection which you can reuse many times. Then the speed of operation when coded properly will be dominated by system memory bandwidth.

It comes with some size restrictions etc, but nothing really terrible.
There probably are some other negative aspects, but I havenā€™t thought about that fully yet, since Iā€™ve just written the code:

Mouse Click to select.
Press 1 to start and stop training.
Press 0 to start again.
Edit:
https://editor.p5js.org/seanhaddps/sketches/N2ZssccE3
View:
https://preview.p5js.org/seanhaddps/present/N2ZssccE3

I think the main use of associative memory might be to give a controller neural network an external memory bank allowing it to store and retrieve information at will.
Of course learning how to use memory is a very difficult proposition for a neural network. Traditional RAM presents impossible needle in a haystack problems of locating again what was written before. And so the tolerance for addressing errors shown by associative memory makes it seem like a good option.
https://deepmind.com/research/publications/2019/neural-turing-machines

Metaphor.

'Kay. :pineapple:

I think of classical associative memory as pattern completion; part is provided as the key and the final product is the response. This is a simple spatial process. The input and output are all part of the same neural network system - there is no separate AM sub-system, this is built right into the fabric of the NN system.

When you extend this pattern matching/completion to include temporal content you have predictive coding.

The degree of temporal extent can vary. HTM matches to a single time step, as defined by the update cycle of the system. This HTM prediction is updated at each time step and can include the possibility of many possible pattern matches. The failure to match any of the stored patterns is considered a valuable indicator of surprise and IMHO had been the area of the greatest utility of HTM so far.

A pure Markov chain can predict much further into the future. Text auto-completion would be a good example of this in action. The pattern completes to several possible futures and is offered as the system output. Generally, the tree of offerings is pruned based on some sort of statistical weighting. The pattern completion process may be more specific as more input pattern is offered.

3 Likes

If you do correct pre-processing weighted sums can be used in error correction mode. Which allows approximate pattern completion.
The requirements are non-sparse linearly independent vector inputs to the weighted sums. And that you store less than the maximum possible number of associations. The variance equation for linear combinations of random variables shows under which circumstance weighted sums give a noise contraction response.
If you use weighted sums to make vector to vector associative memory the error correction provided is biased toward the weight vectors of the weighted sums, which makes things a little tricky.

Are there any previous attempts for an associative memory, or (even better) a very large database for SDRs? @SeanOConnor - the associative memory you-re working at, do you have some performance vs capacity metrics you aim for? I mean in terms of how many patterns can it store and how many queries can it answer per second?

I have this feeling that an associative memory should very well scale with storage which is more cheaper and more power efficient than scaling with computing power which is now the trend with transformers.

As some research suggests there is some functional equivalence between associative memory and attention, one rising question would be what is the price/performance tradeoff?

Just as a back-of-the-envelope estimation, a cluster with 100 x 1TByte SSD drives would arguably match brainā€™s capacity of storing 100T parameters, while largest transformer models struggle with 2 orders of magnitude fewer.

Of course there remains lots of opened questions like

  • how would one index/optimize the data in order to maximize query performance
  • what performance would one get from an e.g. 100 nodes cloud, would that be sufficient toā€¦ ?
  • then what? How would one design then implement ā€œintelligenceā€ on top of a sufficiently large associative memory database?

Sorry for the long message - microsoftā€™s new 500B parameter model is trained on a ~$100M machine (560 DGX A100 nodes priced at $199k each).

I donā€™t have the resources to hyper-scale the associative memory I experiment with. However you could scale up using SSD drives and using locality sensitive hashing to select between blocks of weight parameters. There are many possible ways of arranging that weight parameter selection.
Anyway I have improved the base algorithm I am currently using.
Reddit comments I made:
"This type of associative memory is like a single fully connected layer using ReLU.

Except the concept of ReLU has been extended from a function to a switching scheme.

It uses double the normal number of weights.

The training algorithm is extremely simple but hardly known I would say.

Edit: https://editor.p5js.org/seanhaddps/sketches/TFR8xcQ6y

View: https://preview.p5js.org/seanhaddps/present/TFR8xcQ6y "
And:
" If you take out the random projection and the vector length (magnitude) normalization then you have an interesting ā€˜Beyond ReLUā€™ neural network layer. The order though is non-linear function (or actually scheme) and then weights.

It also meshes well with CPU cache and memory pre-fetch behavior."

It is the vector length normalization that allows the especially simple weight updates that only involve adding or subtracting error terms from the weights.

1 Like

SSDā€™s can cope with around 10-100k IOPā€™s with 4k page size (excess IO - wasted bandwidth) so the whole issue of scale is more to do with random IO bandwidth retrieval efficiency (hit ratio), not processing or data volume per se. 1,000 SSDā€™s even at 100k IOPs is still potentially only 100m IO / second with no caching and maybe 1-10bn with a high hit ratio and local caching. That 100m IO is also 400GB/sec AND assumes read only (no saving state or altering parameters with learning).

I have implemented an experimental associative memory in the form of a sort of rolled out RNN. The last play was with around 8bn elements (Wikipedia and another data set), which sort of represents a few m inputs and about 20bn parameters - effectively a super sparse RNN. This takes up about 400GB spread across 18 servers (non GPU) on a 1+20gb network. The performance is only about 5bn updates (read + write word associates) per second but on typical timing for speech (word sparsity factoring) it works in real time ok. The system can just scale up by adding plain cheap nodes - no GPUā€™s involved. This is also using f32 math, no bit pattern losses and approximations.

This is also based on machines that are 10yrs old so could easily hardware scale this up by around 100x, but I think (maybe foolishly) that I have enough resources at the moment to get somewhere reasonable. Efficiency vs Chance. I also donā€™t think we need anywhere near 100T parameters (for a comparable IQ result) because computers can implement what biology canā€™t or is inefficient at representing.

Giving an AGI a memory dump of the internet (100ā€™s of T parameters) without any ā€œstructureā€ will result in a whole heap of trouble and also completely traumatic if constrained to a human brains equivalent worth of cortical columns. ā€œstructureā€ in this regard is a whole separate topic.

The 18 machine cluster I have has the equivalent memory bandwidth just over 1TB/sec, but that is also at 64bit (more efficient on sparse data).

That said my experiments are still in cavemen land as I have realised the sparse memory (long term memory) might/actually needs hippocampal pre-processing for it to make sense and work in conjunction with a HTM structure in a fully scalable manner before predictive feedback/response with the PFC. Surprise factor in learning for a computer is different (more efficient and accurate) compared to the biological process that has a probabilistic chance of pre-existing connection capability requirement (sleep is a separate temporal process). This inherently creates a different bias to those in a human.

Incrementally our long term memory remembers more complex concepts rather than the details of the memory every time. This is where HTM plays a feedback role in long term associative memory structure. The two are different but tied together via the hippocampus.

We see something strange and remember a whole bunch of weird stuff and then when someone tells us what it was we go ā€œah, so thatā€™s a xxxxā€ and then just remember the xxxx in the future. We still remember the first random set of stuff but no longer see that in the future, just the xxxx. This linkage and structuring is where the hippocampus plays part of itā€™s job.

The referenced paper would seem to be more of a plain associative isolated memory. Good for some tasks and a bit of
a step in the right direction but not necessarily where I think we need to be for AGI.

1 Like

This all makes me want to mess around with drive controllers and their firmwareā€¦

Imagine that we could make our input space fit into a single page of memory, then distribute that to arrays of drives with custom firmware. The pages on those drives would be bitmap overlaps to the input space.

CPU would send input encoding bits to the drives with a custom instruction/protocol.

If someone were to hack a drive to do the following, that would be neat:

  1. receive a page of bits (hopefully stored in fast cache/RAM)
  2. compare (bitwise-AND) pages across the drive (potentially in parallel across different flash chips in the drive), receiving overlap scores + page ID

I imagine scaling this basic concept up would allow for the formation of all the datastructures and basic operations of HTM, so that minimal management would be required from the actual CPUs except for the final classification/regression step, as well as random administrative tasks.

Just a random thought.

1 Like

Wow, this gets interesting, thanks

I didnā€™t mention but what I understand as associative memory for SDRs would be able to:

  • Recall a previously entered SDR when the same SDR is used for query (obviously)
  • As ā€œassociativeā€ implies, even when queried with a significantly smaller subset of a previously learned SDR it should respond with the right match. That is great yet potentially ā€¦ overwhelming. e.g. when querying the database with the SDR encoding the word ā€œtheā€ itself.
  • what would be even greater - when queried with a superposition of several patterns to respond with each of those individual patterns.
  • I wouldnā€™t mind if ā€œperformanceā€ of recalling a particular pattern increases with repeated exposure to that pattern and recency of exposure. That would make it biologically realistic, not that biologism is a virtue in itself, but thatā€™s what I would expect from an intelligent peer - to prioritize both recent and thoroughly studied old patterns.

@SeanOConnor I try to understand your code. It seems simple enough to attempt to translate it into python, but I donā€™t even get what am I expected to do and what I see. Just an explanatory story besides what to do with the mouse and what keys to press would help a lot.
Trying to figure the codeā€¦ you might want to consider python/numpy for this kind of experiments, it is great at simplifying the nitty gritty array/vector operations.

@BrainVx your story is awesome. Yeah, maybe SSDs arenā€™t the right choice, unlessā€¦ I donā€™t know ā€¦ either it operates with large, dense patterns, or more likely to have some extra ā€œcodeā€ or info/structure behind each learned pattern. e.g. lists of links to ā€œprevious frequent contextsā€. But for a cluster of machines implementing an associative memory engine only, probably SSDs are both excessively large and excessively slow.

@MaxLee that would be cool, but designing new hardware is way over my head. Plus just scanning the whole SSD for each querry wonā€™t work, thatā€™s what indexes are for (think of search engines).
But the idea of pairing lots of flash/nvme ā€œcellsā€ with many specialised compute units on a chip, might have some (possibly huge) potential. Like some sort of GPU/TPU having a high bandwidth , few TBytes, nvme inside so it can run at least inference on GPT3-scale models? That would be cool. I bet someone filed a patent on that. Or not :slight_smile:

No need to design new hardware, Iā€™m thinking of just taking off-the-shelf hardware and loading my own code onto its internal controller.

Implementing your own firmware for a controller still ends up constrained by the individual drive bandwidth and would only gain a relatively small performance gain in the context of bit processing on large volumes of data. Logic functions are one cycle operations for CPUā€™s and generally have to wait for the data to arrive, so the CPU would sit idle most of the time, especially with a high thread count CPU. Far cheaper and quicker to implement is just to stack up loads of cheap kit with a few 10ā€™sGB each and UDP tasks on a 1gb link (on the same switch backbone to prevent packet loss). 48 nodes and then stack switches to get to 96 or more nodes on a packet loss free basis. 100 nodes with 30GB data each that can be bit scanned in half a second or 6TB/sec for the cluster with old kit, plug and play.

If you have the ability to write firmaware for a controller, then you should really be looking at FPGAā€™s as these can provide some really esoteric options with much, much higher bandwidth and parallel clock syncronous multi stage bus processing similar to a GPUā€™s performance.

Writing firmware for someone elses hardware is a painfull, torturous and long process, especially when you discover wierd timing bugs in the hardware implementation or other design flaws. Hardware design flaws are only corrected if the original firmware finds them or in the case of Intel when some hacker discovers a flaw.

Recall bias (priming) for associative memory requires a temporal state in the memory, weather the specific memory is recalled or not (concept priming) and also has an effect on the memories in close temporal proximity (temporal priming). This is where I realised a lot of the memory needs to be read then written back, which is where SSDā€™s just dont have the write endurance (IOPs asside).

To me the word ā€œtheā€ is irrelevant to all occurances (no need to search the whole memory) and is why any associative memory needs a temporal biasing element. ā€œtheā€ basically says give priority / more weight to temporal recall presedence.

I also think/wonder if the word ā€œtheā€ and some others may only exist in utterance form to just to expand upon the compressed memory representation / structure in order to allow effective communication (mirror activation) to another brain. The brain in thoery could create synapse links that remove the need for use of ā€œtheā€ and just creates an access ambiguity to a particular memory which triggers ā€œtheā€ to represent past and present at the same time.

To me it makes sense if tyring to fit the Hippocampus, HTM, PFC and long term memory processes all together as a whole rather than in isolation.

Maybe I have lost the plotā€¦ lol.

I guess this is the part that Iā€™d try to move to the firmware itself, rather than shipping it back to the CPU to execute thisā€¦ basically, each drive would be doing its own bitwise-AND operation, only reporting back the winning page locations. This would be distributed computing where some basic computation is brought as close to the storage medium as we can get it in a potentially highly scalable (inexpensive) Just-a-Bunch-Of-Disks setup.

Absolutely. Iā€™ve been there and done thatā€¦ the tradeoff is that Iā€™d rather use their silicon and hardware if I can, rather than become a full device manufacturer. The firmware would require time as capital, while the latter would require large amounts of cold hard cashā€¦ Iā€™d rather write firmware and make naive, potentially slower-than-needed timing assumptions. Itā€™s also not impossible to read existing firmware and reverse engineer some of the more difficult timing aspects.

I did some work with this a couple years back, but again, scalability and expense becomes an issue. FPGAs are neat for proving concepts and creating custom esoteric functions in hardware, but theyā€™re also expensive power hungry little shits that come with all their own issues (just thinking about their licensing issues makes me sick). If I wanted scalable, Iā€™d prefer to use very likely the same chip (with minor revisions over time) already inside a drive, which has been battle tested by some large drive OEM in millions of devices.

Iā€™d be okay with an idle CPU, as keeping that ā€˜busyā€™ wouldnā€™t be the goal. Instead Iā€™d be going for 10ā€™s of drives each hosting thousands of columns each doing lookup comparisons and minor updates to their bitmaps as directed by the CPUā€™s coresā€¦ I argue this would actually be more in line with what happens in our own brains with its distributed parallel processing.

1 Like

Thatā€™s a highly refined skill you describe. The only problem with the approach is demands the controller to scan a query pattern across the entire content of the drive. It makes no sense to make internal SSD bandwidth orders of magnitudes larger than the already pretty fast 1-2GByte/s new M2 SSDs perform, so any single query demanding linear scanning of the entire drive would last a couple minutes.

Thatā€™s why indexing schemes are for - to avoid search by comparing every block on the drive with the requested pattern.

But hope is not lost if the controller you can write firmware for resembles a 32bit generic processor with a few hundred kilobytes of RAM to spare - then it could implement (most heavy part of) the indexed search algorithm itself.

2 Likes

Yes! such a mechanism exists in the presynaptic terminals:

Synaptic Theory of Working Memory
Gianluigi Mongillo, et al.
Science 319 , 1543 (2008);
DOI: 10.1126/science.1150769

Iā€™m kind of retired from the subject. Iā€™m not in explain mode. However I do have a 40 page booklet on the fast Walsh Hadamard transform (=$), where some of the details are gone into.
I think itā€™s probably best to look at a simpler associative memory system first.
One where you take a locality sensitive hash of the input data. Ideally using -1,+1 binarization of the hash outputs. And then add a linear readout layer.
What guarantees of linear independence is LSH giving you? (=Capacity)
What is the variance equation for linear combinations of random variables telling you about the level of recall noise.
Also the central limit theorem is involved.
As to the code I gave above, I havenā€™t really quantified how well it does, itā€™s just something that occurred to me while looking at neural networks that extend the concept of ReLU a bit and it seems like a computational shortcut but who knows?
I have low personal motivation to look into it more. However I would like to find some type of memory system or algorithm that can sensibly augment artificial neural networks.