If you take the hash of some input to get bipolar outputs then multiplying those +1/-1 values by weights and summing allows you to store values with a high degree of independence. For a given binary input and target value you get an error. You divide the error by the number of number of bipolar values and then you simply correct each of the weights using the error value and taking account of the bipolar sign. That gives a full correction to get the correct target output. In higher dimensional space most vectors are orthogonal. For a different binary input the adjustments you made to the weights will not align at all. In fact they will sum (cancel out , almost) to Gaussian noise by the central limit theorem. The value you previously stored for a different input will now be contaminated by a slight amount of Gaussian noise which you can correct for. This will now introduce an even smaller amount of Gaussian noise on the value for the other input. Iterating back and forth will get rid of the noise entirely for both binary inputs.
I suppose then there are 2 ways to train a single layer of a deep neural network. If you just alter the weights then for every input every output will be different. However if you view a single layer as an associative memory then you can alter the response of the layer for a single input only, its behavior for different inputs being only marginally affected. This should avoid issues like catastrophic forgetting. It would also suggest that altering the response of the layer in such a mode should involve distinct pattern search not probing around with Gaussian noise. You would need to morph back-propagation so that it looked more like Hooke and Jeeves, or actually use Hooke and Jeeves or a similar pattern search to optimize the deep network associative memory layer responses.
There were a lot of papers in the 1970’s and 1980’s about associative memory. It turns out that it is actually very easy to do. I’ve written some code in FreeBasic which should be easily translatable to C. If you use excess memory (giving an over determined solution) you will get a sort of error correction effect. `
The number of elements in the vector must be a power of 2 (16,32,64…) because of the wht function.
I guess of interest in relation to SDR is Lernmatrix:
Anyway this paper says you can get exponential storage storage of patterns with a similar scheme to the Basic code above (not Lernmatrix), however their final output is binarized to +1/-1.
It seems the binarized output case is special, I don’t think you can get around it directly if you want both the increase of storage capacity and have say a floating point output. I’m still figuring things out.
I guess there are some tricks you can do like learning the responses of a (locality sensitive) hash with an associative memory.
Then where you normally would use the hash, instead use the associative memory (as an error correcting hash!)
You could even start off with a normal (non locality sensitive) hash where a change of a single bit in the input causes a completely different output. Learn its responses on a training set. Then when you replace that hash with the associative memory you actually get locality sensitive error correcting hash behavior.
And since the associative memory uses hashing as a key component things can become very self referential…
I tried out the idea of learning hashes a bit in some experiments. I was able to greatly reduce the amount of crosstalk between memories. Which is very useful depending on what you want to do exactly. It also seems I am getting error correction without drift to unwanted attractor states caused by cross talk, but more experiments are needed.
I had the memory learn (input,hashed input) vector pairs. The hash being non locality sensitive. Then I had it learn (hashed input,target) vector pairs.
In some other cases crosstalk is a good thing as it is a form of generalization, I should say.
I’ll recap the associative memory system.
You hash an input vector to get a vector of values +1 or -1.
You multiply each of the plus or minus 1 values by a weight and sum the result. To store a (input,x) pair you process the input to get an output value which will be just a sum of the weights multiplied randomly by +1 or -1 (because the hash randomizes inputs). It likely won’t be exactly zero but be low in value and have a bell curve probability distribution (Gaussian.)
You then adjust each of the weights by small amount, just enough to get exactly the x value you want. When you recall a previously stored value all the weights will have been slightly damaged by the new pair you stored. However the damage will sum to nearly zero because it is multiplied by a completely different set of plus and minus 1s. You can then go and fix the slight error induced.
I hope that makes sense.
It looks like you can get some good error correction if you use enough memory.
Linux AMD64 code:
I did the code in the Processing language (nearly Java.) I felt the example needed more weight memory than I would like. I presume that is a price you pay for binarization with small systems.
Usually scaling up goes against you, the unexpected thing with neural systems is that scaling up can go with you. I was thinking about recalling sparse patterns with associative memory. That could allow you to store/recall far more patterns than recalling dense patterns for the same amount of weight memory.
I guess you can train the sort of sparse associative memory suggested by Albus the same way as I have in my associative memory code.
By – I don’t know what to call it, forced Gaussian updating??
Two changes you could make are using locality sensitive hashing (LSH) instead of the more brittle plain hashing originally proposed. And have that LHS produce 0,+1 or -1 as outputs. With the +1’s and -1’s being sparse. What does that buy you mathematically??? I’ll think about it today.
And since the Albus CMAC is rather like a Bloom filter (BF) I guess you can create a real valued output BF. With the extra benefit that by using excess memory you can have an error correction effect (to the degree that averaging repetitions helps.) I showed code before for a locality sensitive BF. I suppose you can put forward a framework where all these ideas interweave and interconnect. The math though must be different for simple 0,1 binarization and bipolar binarization -1,+1.
You would kind of imagine that sparse associative memory would exhibit less crosstalk than dense associative memory.