I wrote this about a weight matrix being simple linear associative memory and then ReLU functions acting as gates resulting in context based sub-selection of weights in the weight matrix.
So then you are really getting soft context selection if the width of the neural network is large enough to allow statistics to take over.
And the context to select can be learned by the prior layer.
https://archive.org/details/gated-linear-associative-memory
So what happens in a deep neural network then?
https://archive.org/details/extending-associative-memory-with-gating-in-deep-networks
While true, it should have some raw performance and capacity tests.
Capacity: How many patterns can a given “memory shape” store before it overflows.
Performance: How long does it take to “memorize” them.
One rudimentary test would be to force a MLP learn a random X, Y dataset, and see at how many points it isn’t able to overfit. Overfit in this context means “good” because it means it is able to recognize any learned pattern.
You do have to look into the details of simple linear associative memory. There are a lot of edge cases (eg. linear independence) that it messy to investigate.
People tend to back away then from getting clarity.
For real world data a lot of the edge cases simply don’t occur. You can say pragmatically I don’t have to deal with this edge case or it is statistically rare.
Anyway if you say gating is context then the state space of possible context is massive. For a network of width 256 the context space is 2²⁵⁶. Obviously there are not going to to be 2²⁵⁶ training examples so only some of the possible contexts will occur. And you would expect clustering of those contexts (Hamming distance wise.)
I’ll have a think about it.
I think interesting results can be obtained by using the negative of median threshold as biases in the classical X.dot(weights) + biases. Weights is random matrix, therefore the above performs a random projection . Each output value splits the inputs in half (0 or 1 bit). I did this with MNIST and with only 10 bits of output (784x10 random matrix dot product + median split for each of the 10 resulting values) produced a 10 bit number as a result, which means there are 1024 possible bins. Instead of spreading the 70000 digits into an uniform 70digits/bin they-re very imbalanced - least populated bin contains only 4 weird looking digits, while most populated ones collected hundreds of digits with a strong bias towards a very specific shape. The intuition is that chaining the same simple split might lead to a very cheap perceptive memory. But not by chaining another bloc for depth which blindly adds layer after layer (as in NNs) but by branching.
ReLU is a poor stepping stone to extending weight selection (gating) ideas in neural networks.
Two sided ReLU where there are 2 outputs, one active for negative inputs and one active for positive outputs is a better basic concept. That means doubling the number of weights in the next layer.
Then you can think about replacing say a group of 4 (2 sided) ReLUs with a pool of 16 blocks of 4 weights. And using the 4 bits of ReLU state to select between the 16 blocks instead.
You are right about the initial Random Projection behavior and a binarized random projection is a locality sensitive hash. Where a small change in the input only changes a few bits of the output.
If you use that for gating a weight matrix in the next layer what you have is a hash table table that looks up a linear association. Each gating state looks up a different linear association (a synthetic matrix, or constructed matrix.) And close by gating states would tend to give similar synthetic matrices (locality sensitive aspect.)
There is a whole lot of weight sharing between the different synthetic matrices which becomes apparent during training and has to be resolved during training.
That could be good for generalization because the training algorithm has to pick out what consistently makes sense rather than simple memorization.
This viewpoint allows you to construct a very wide variety of neural network variants.
You could for example have a chain of weight matrices Wₛₙ…Wₛ₂Wₛ₁x where s is choice between one of 2 matrices based on a locality sensitive hash of its inputs.
I’m not saying that would be a good neural network, however undoubtedly it would be trainable. And be one of the most basic neural networks you could create.
Maybe this is more readable:
https://archive.org/details/report-on-gating-based-extensions-of-re-lu-neural-networks
An seemingly important aspect is reinforcement of gating decisions:
https://archive.org/details/self-reinforcing-gating-via-directional-alignment
I think whatever ideas you have for extending neural networks have to fit in with that.
Basically gated lookup of a linear mapping is indirection. And the gating aspect induces a certain amount of quantization in the possible mappings. There are other schemes you could think of where that lookup could be much finer grained. I’ll have a think about it.
Basically by indirection I mean you are looking up something (indirection) and then multiplying something else by that something.
Ie. Looking up a weight matrix and then multiplying the layer input vector by the weight matrix. Though the two things are merged together in a ReLU layer.
Conceptually you can separate them though.
It sounds like you could replace the roughly quantized gating based lookup with actual associative memory (AM) lookup of a weight matrix. That would give very fine quantization.
I’m not sure in practice you would need such fine quantization.
The other possible advantage of AM is you can make quick very high capacity AM with various tricks.
I’ll look into it.
I created this introduction type thing:
https://archive.org/details/synthetic-matrix-neural-network