Gated Linear Associative Memory

SeanOConnor · March 25, 2026, 1:24am

I wrote this about a weight matrix being simple linear associative memory and then ReLU functions acting as gates resulting in context based sub-selection of weights in the weight matrix.
So then you are really getting soft context selection if the width of the neural network is large enough to allow statistics to take over.
And the context to select can be learned by the prior layer.
https://archive.org/details/gated-linear-associative-memory

SeanOConnor · March 25, 2026, 6:28am

So what happens in a deep neural network then?
https://archive.org/details/extending-associative-memory-with-gating-in-deep-networks

cezar_t · March 25, 2026, 4:26pm

While true, it should have some raw performance and capacity tests.

Capacity: How many patterns can a given “memory shape” store before it overflows.

Performance: How long does it take to “memorize” them.

One rudimentary test would be to force a MLP learn a random X, Y dataset, and see at how many points it isn’t able to overfit. Overfit in this context means “good” because it means it is able to recognize any learned pattern.

SeanOConnor · March 26, 2026, 5:43am

You do have to look into the details of simple linear associative memory. There are a lot of edge cases (eg. linear independence) that it messy to investigate.

People tend to back away then from getting clarity.

For real world data a lot of the edge cases simply don’t occur. You can say pragmatically I don’t have to deal with this edge case or it is statistically rare.

Anyway if you say gating is context then the state space of possible context is massive. For a network of width 256 the context space is 2²⁵⁶. Obviously there are not going to to be 2²⁵⁶ training examples so only some of the possible contexts will occur. And you would expect clustering of those contexts (Hamming distance wise.)

I’ll have a think about it.

cezar_t · March 26, 2026, 7:52am

I think interesting results can be obtained by using the negative of median threshold as biases in the classical X.dot(weights) + biases. Weights is random matrix, therefore the above performs a random projection . Each output value splits the inputs in half (0 or 1 bit). I did this with MNIST and with only 10 bits of output (784x10 random matrix dot product + median split for each of the 10 resulting values) produced a 10 bit number as a result, which means there are 1024 possible bins. Instead of spreading the 70000 digits into an uniform 70digits/bin they-re very imbalanced - least populated bin contains only 4 weird looking digits, while most populated ones collected hundreds of digits with a strong bias towards a very specific shape. The intuition is that chaining the same simple split might lead to a very cheap perceptive memory. But not by chaining another bloc for depth which blindly adds layer after layer (as in NNs) but by branching.

SeanOConnor · March 26, 2026, 11:11pm

ReLU is a poor stepping stone to extending weight selection (gating) ideas in neural networks.
Two sided ReLU where there are 2 outputs, one active for negative inputs and one active for positive outputs is a better basic concept. That means doubling the number of weights in the next layer.
Then you can think about replacing say a group of 4 (2 sided) ReLUs with a pool of 16 blocks of 4 weights. And using the 4 bits of ReLU state to select between the 16 blocks instead.

You are right about the initial Random Projection behavior and a binarized random projection is a locality sensitive hash. Where a small change in the input only changes a few bits of the output.
If you use that for gating a weight matrix in the next layer what you have is a hash table table that looks up a linear association. Each gating state looks up a different linear association (a synthetic matrix, or constructed matrix.) And close by gating states would tend to give similar synthetic matrices (locality sensitive aspect.)

There is a whole lot of weight sharing between the different synthetic matrices which becomes apparent during training and has to be resolved during training.
That could be good for generalization because the training algorithm has to pick out what consistently makes sense rather than simple memorization.

This viewpoint allows you to construct a very wide variety of neural network variants.
You could for example have a chain of weight matrices Wₛₙ…Wₛ₂Wₛ₁x where s is choice between one of 2 matrices based on a locality sensitive hash of its inputs.
I’m not saying that would be a good neural network, however undoubtedly it would be trainable. And be one of the most basic neural networks you could create.

SeanOConnor · March 26, 2026, 11:42pm

Maybe this is more readable:
https://archive.org/details/report-on-gating-based-extensions-of-re-lu-neural-networks

SeanOConnor · March 28, 2026, 8:31am

An seemingly important aspect is reinforcement of gating decisions:
https://archive.org/details/self-reinforcing-gating-via-directional-alignment

I think whatever ideas you have for extending neural networks have to fit in with that.

Basically gated lookup of a linear mapping is indirection. And the gating aspect induces a certain amount of quantization in the possible mappings. There are other schemes you could think of where that lookup could be much finer grained. I’ll have a think about it.

SeanOConnor · March 28, 2026, 2:06pm

Basically by indirection I mean you are looking up something (indirection) and then multiplying something else by that something.
Ie. Looking up a weight matrix and then multiplying the layer input vector by the weight matrix. Though the two things are merged together in a ReLU layer.
Conceptually you can separate them though.

It sounds like you could replace the roughly quantized gating based lookup with actual associative memory (AM) lookup of a weight matrix. That would give very fine quantization.
I’m not sure in practice you would need such fine quantization.
The other possible advantage of AM is you can make quick very high capacity AM with various tricks.
I’ll look into it.

SeanOConnor · March 30, 2026, 5:30am

I created this introduction type thing:
https://archive.org/details/synthetic-matrix-neural-network

SeanOConnor · March 30, 2026, 11:56pm

And this: https://archive.org/details/viewing-re-lu-networks-as-input-dependent-matrix-selection-wd-form

SeanOConnor · April 2, 2026, 3:59am

The saga continues (edited):
https://archive.org/details/report-multiple-views-of-re-lu-neural-networks_202604

If you think about saturating activation functions then for a particular input some neurons are saturated and will remain saturated (not change much) in a local region around that input. They remain static while other neuron values move around. And that is a type of gating too.

SeanOConnor · April 3, 2026, 4:11am

When it rains it pours:
https://archive.org/details/re-lu-gated-neural-network

https://archive.org/details/re-lu-region-partition-diagram

https://archive.org/details/re-lumerged-region-hash-table

SeanOConnor · April 4, 2026, 1:28am

Gating sparsity and weight matrix sparsity in ReLU neural networks:

SeanOConnor · April 5, 2026, 12:06am

I would say there are some notation difficulties in those documents. I don’t know if there is some more standard way to present things.
Anyway the basic concepts are so simple, I think many people can understand what is being said.

Understanding the region merging capability of linear mapping helped me understand something about a sparse neural network design I have and an ultra sparse network design.
More gating (switching) per parameter count seems like more intelligent decision making occurring in a network of fixed size. However you get to a point where region merging capacity breaks down when you push sparsity too far.

cezar_t · April 5, 2026, 4:37pm

Maybe it would help a code example of associative memory storing random vector pairs of fixed size - to get a clue on memory capacity and compute cost.

SeanOConnor · April 9, 2026, 9:34am

I have lots of types of associative memory.

Anyway I just put this together:
https://archive.org/details/re-lu-neural-networks-as-hierarchical-associative-memory-via-geometric-factorization

SeanOConnor · April 9, 2026, 1:42pm

Conditional switching and capacity in neural networks.
A very simple introduction:
https://archive.org/details/conditional-switching-and-capacity-in-neural-networks

SeanOConnor · April 9, 2026, 10:54pm

More arguments in favor of artificial neural networks being viewed as hierarchical associative memory:
https://archive.org/details/neural-networks-as-hierarchical-associative-memory

Topic		Replies	Views
Column selection in ReLU neural networks induces associative memory Lounge	6	57	April 27, 2026
Multisynapse optical network outperforms digital AI models Lounge	28	336	August 6, 2025
Towards demystifying over-parameterization in deep learning Lounge	5	1139	May 18, 2019
Deep linear networks are okay Lounge	8	2800	October 12, 2017
When I look at what deep neural networks do Lounge	9	1278	March 1, 2018

Gated Linear Associative Memory

Related topics