I’ve collected together most of what I’ve done with associative memory:
https://github.com/S6Regen/Associative-Memory-and-Self-Organizing-Maps-Experiments
I’ll see if I can connect it to ideas about self organizing maps. So that even in the face of excess data it can pick up and hold on strongly to the useful information rather than being overwhelmed and outputting weak averaged kind of responses.
The code is in Java.
Hi Sean two quick questions:
- Would it be possible to add some documentation in the form of a overview?
- If you use the SSD as ram are you not worried about the limited writes that can be performed on a SSD?
It seems self organizing maps (SOM) learn to some first approximation the manifold the input data resides on. And maybe the vectors learned are uniformly distributed on that manifold, though very non-uniformly the actual basis space. That could be useful for Probabilistic Programming which seems to be a new emerging area in machine learning.
I’ll have to read more about it.
I’m sure you can enact competitive learning similar to SOM with several associative memory (AM) systems converging through non-linear learning. It should be able fill the AM’s to their maximum capacity, avoiding the dilution from trying to learn everything equally with linear learning.
I showed locality sensitive hashing AM that enables you to use the entire RAM of your system while still operating at reasonable speed. In fact you could give over an entire SSD drive to such an AM and still get reasonable speed. Or one SSD for each weight vector dimension. In practical reality then you could put 100 1Tbyte SSDs on a single USB hub and have a massive AM. Who would ever do such an experiment!
I guess you could get 10000 block read and writes per second with an SSD. Once the addressing setup is done on the SSD (disk latency) it can transmit the entire block of data very quickly. It should still be possible to train say 50 to 100 examples per second.
I did try it with a 10ms latency (rust based) harddrive one time and it was okay. But I didn’t fully explore the matter because I didn’t want to wreak the HD. That thing failed later of old age. Now I just run Linux off a cheap USB stick.
I’ll provide some documentation later when I’ve done more work.
A locality sensitive hash (LSH) maps the input data to some particular random set of bits. Because it is locality sensitive a small change in the input will only flip the state of a few of the output bits. The output bits can be viewed as bipolar values -1 and 1.
To get the output of the AM the input is LSHed and each bipolar value is multiplied by a weight and summed giving an output value.
To train the AM the output for a particular input is calculated and the error compared with the actual output wanted is calculated by subtraction. This error is divided by the number of weights and each weight is updated with this small fraction taking into account the sign of corresponding bipolar value.
Obviously when next recalled this will result in exactly the correct wanted output.
What then is the effect on previously stored <vector, scalar> pairs?
Each weight will have been ‘damaged’ by the small update. Assuming the bipolar values are relatively uncorrelated with those of the recently updated example then those weight errors will tend to cancel out because they randomly added and subtracted. The residue is just some Gaussian noise term due to the central limit theorem.
That is each previously stored pair is ‘damaged’ by a small Gaussian noise term as a result of storing or updating a new pair.
Repeatedly training over a small enough set of pairs using an AM with enough memory
drives the Gaussian noise terms to zero. That is probably an O(n^2) process compared to an O(n^3) process if you tried to solve the simultaneous linear equations by Gaussian elimination. If you use a good excess of memory it is a lot faster than O(n^2).
You can use a double LSH process to further push things for use with vast memory.
I got a big improvement in the error correcting associative memory by changing from a binary (bipolar) threshold step function (-1,1) to using a soft version of that, the signed square root. y=sqrt(x) ;x>=0 and y=-sqrt(-x) ;x<0
That function upon repeated application has 2 fixed point attractors -1 and 1.
There is almost certainly a better function to be found that would allow for the optimal recovery of signal from noise, but I’ll leave that for a paid computer scientist to find. The code is in the file AMEC2.java
I also have these neural networks I’m working that aim to find more controlled decision regions than conventional neural networks:
https://github.com/S6Regen/Black-Swan
Each layer is actually an “extreme learning machine” that in some sense allows more than fully connected behavior at the layer level. Also there are always back connections to the input. The activation function is “square of” which works very well with the evolution driven training, perhaps because it is sparsity inducing.