Basically they are doing an optical (quasi) random projection using light and then a digital computer readout layer.
You can call it 2 layer associative memory, one random layer and then a readout layer.
LightOn had a similar product which they abandoned, but the company has continued by switching to other things.
Probably one reason for abandoning the product is you can do the same thing using fixed pattern (quasi) random sign flips and the fast Walsh Hadamard transform to do the same thing as diffusion/diffraction.
It is interesting they are getting results comparable to deep neural networks using associative memory.
At least the researchers seem to understand some aspects of the weighted sum.
For some reason you can’t just go and read about the characteristics of information storage in the weighted sum.
There is no storage in ELM, it is simply a 1 hidden layer perceptron
it randomly initializes the input weights and biases of the hidden layer and then analytically determines the output weights in a single step
That operation is very compute/memory intensive, long ago when I tried it on mnist, with a relatively large-ish hidden layer it failed on my laptop.
Analytical Output Weights:
The output weights connecting the hidden layer to the output layer are then calculated using a Moore-Penrose pseudo-inverse, a linear algebra operation.
And it also does not outperform a back-propagated MLP of same shape, which is mediocre compared with e.g. CNNs
A bit more successful randomized networks are recurrent ones like reservoir networks or liquid state machines.
There are 2 active layers. A (fixed, ever unchanging) randomly connected layer (which can be implemented optically or with mixed up fast transforms).
The randomly connected layer should typically include a non-linear activation function.
Then there is a read-out layer which typically is fully linear.
If you look into the mathematics of linear layers (weighted sums) there is a statistical advantage in giving them fully dispersed inputs rather than inputs which may be sparse. Like in terms of recall error variance.
The training of the read-out layer may be a Moore-Penrose pseudo-inverse or I personally use SGD.
The author of the original extreme learning machine paper points out that keeping the magnitude of the weights of the read-out as low as possible improves generalisation. And Moore-Penrose is said to help with that.
This is where you would want to get into the minutia of information storage in the weighted sum.
And isn’t funny how differently associative memory and neural networks are evaluated?
Associative memory is judged by error correction.
Neural networks are judged by generalisation.
And honestly that’s the wrong approach. Especially as you can assign an error correcting code as an intermediate target to an AM or NN. Then get the AM to recall the error correcting code, correct the code and then have another AM to map from the error correcting code to the required output.
This is where I am not sure what Hopfield’s real contibution was. Or did he just send AM researchers into the maze of the Minotaur.
It might be better to judge AM and NNs on the same footing.
public class AMChannelG {
final int vecLen;
final float[] wts;
final float[] work;
final float rate;
public AMChannelG(int vecLen,float rate) {
this.vecLen=vecLen;
this.rate=rate/vecLen;
wts=new float[2*vecLen];
work=new float[vecLen];
}
float recall(float[] input) {
int r=0x9E3779B9; //random number seed
for(int i=0; i<vecLen; i++) {
work[i]=r<0? -input[i]:input[i];
r*=0x93D765DD; // MCG random number generator
}
whtN(work); //fast Walsh Hadamard transform
for(int i=0; i<vecLen; i++) {
work[i]=r<0? -work[i]:work[i];
r*=0x93D765DD;
}
whtN(work); // random projection of input in work
float result=0f;
for(int i=0; i<vecLen; i++) {
float x=work[i]; //choose weight according to sign
result+=x*wts[i*2+(x<0f? 1:0)]; //weight choice
} //is the non-linear activation function
return result;
}
public void train(float target,float[] input) {
float e=(target-recall(input))*rate;
for(int i=0; i<vecLen; i++) {
float x=work[i];
wts[i*2+(x<0f? 1:0)]+=e*x;
}
}
static void whtN(float[] vec) {
int n = vec.length;
int hs = 1;
while (hs < n) {
int i = 0;
while (i < n) {
final int j = i + hs;
while (i < j) {
float a = vec[i];
float b = vec[i + hs];
vec[i] = a + b;
vec[i + hs] = a - b;
i += 1;
}
i += hs;
}
hs += hs;
}
float scale = 1f / (float) Math.sqrt(n);
for (int i = 0; i < n; i++) {
vec[i]*=scale;
}
}
public static void main(String[] args) {
AMChannelG amcg=new AMChannelG(256,1.3f);
float[][] example=new float[256][256];
System.out.println("Training example recall.");
for(int i=0; i<65536; i++) {
example[i>>8][i&255]=2f*(float)Math.random()-1f;
}
for(int i=0; i<1000*256; i++) { //1000 epochs
amcg.train(1f-(i&2),example[i&255]);
}
for(int i=0; i<256; i++) {
System.out.println("Recall: "+amcg.recall(example[i])+" Target: "+(1-(i&2)));
}
System.out.println("Random input recall.");
float[] r=new float[256];
for(int i=0; i<10; i++) {
for(int j=0; j<256; j++) {
r[j]=2f*(float)Math.random()-1f;
}
System.out.println("Recall for random input: "+amcg.recall(r));
}
System.out.println("Positive example to negative example recall");
for(int i=0; i<11; i++) {
float proportion=0.1f*i;
for(int j=0; j<256; j++) {
r[j]=(1f-proportion)*example[0][j]+proportion*example[2][j];
}
System.out.println("Recall: "+amcg.recall(r));
}
}
}
First you do a random projection (RP) of the input. Then you choose between 2 weights (depending on sign) of each output value of the RP, which is a non-linear function.
Then you weighted sum the lot using the chosen weights. Which is the linear output layer.
I also have a slightly simpler version that unfortunately doesn’t train well with stochastic gradient descent (SGD.)
Like eventually it will train up okay but at about the same time cost as directly solving the simultaneous equations by standard linear algebra, O(n to the power of 3.)
public class AMChannelB {
final int vecLen;
final float[] wts;
final float[] work;
final float rate;
public AMChannelB(int vecLen,float rate) {
this.vecLen=vecLen;
this.rate=rate/vecLen;
wts=new float[2*vecLen];
work=new float[vecLen];
}
float recall(float[] input) {
int r=0x9E3779B9; //random number seed
for(int i=0; i<vecLen; i++) {
work[i]=r<0? -input[i]:input[i];
r*=0x93D765DD; // MCG random number generator
}
whtN(work); //fast Walsh Hadamard transform
for(int i=0; i<vecLen; i++) {
work[i]=r<0? -work[i]:work[i];
r*=0x93D765DD;
}
whtN(work); //fast Walsh Hadamard transform
float result=0f;
for(int i=0; i<vecLen; i++) {
float x=work[i]; //choose weight according to sign
result+=wts[i*2+(x<0f? 1:0)]; //weight choice
} //is the non-linear activation function
return result;
}
public void train(float target,float[] input) {
float e=(target-recall(input))*rate;
for(int i=0; i<vecLen; i++) {
float x=work[i];
wts[i*2+(x<0f? 1:0)]+=e;
}
}
static void whtN(float[] vec) {
int n = vec.length;
int hs = 1;
while (hs < n) {
int i = 0;
while (i < n) {
final int j = i + hs;
while (i < j) {
float a = vec[i];
float b = vec[i + hs];
vec[i] = a + b;
vec[i + hs] = a - b;
i += 1;
}
i += hs;
}
hs += hs;
}
float scale = 1f / (float) Math.sqrt(n);
for (int i = 0; i < n; i++) {
vec[i]*=scale;
}
}
public static void main(String[] args) {
AMChannelB amcb=new AMChannelB(256,0.5f);
float[][] example=new float[256][256];
System.out.println("Training example recall.");
for(int i=0; i<65536; i++) {
example[i>>8][i&255]=2f*(float)Math.random()-1f;
}
for(int i=0; i<300000*256; i++) { //300000 epochs
amcb.train(1f-(i&2),example[i&255]);
}
for(int i=0; i<256; i++) {
System.out.println("Recall: "+amcb.recall(example[i])+" Target: "+(1-(i&2)));
}
System.out.println("Random input recall.");
float[] r=new float[256];
for(int i=0; i<10; i++) {
for(int j=0; j<256; j++) {
r[j]=2f*(float)Math.random()-1f;
}
System.out.println("Recall for random input: "+amcb.recall(r));
}
System.out.println("Positive example to negative example recall");
for(int i=0; i<11; i++) {
float proportion=0.1f*i;
for(int j=0; j<256; j++) {
r[j]=(1f-proportion)*example[0][j]+proportion*example[2][j];
}
System.out.println("Recall: "+amcb.recall(r));
}
}
}
I wasn’t expecting the training to be so difficult. Things could improve with size increase.
The first one though seems to do well with SGD.
In both examples you have to choose a integer power of 2 for the vector length because of the fast transform.
That makes sense to me, we don’t need NNs to just memorize, that’s trivial? Error correction does not form a generalized representation, it’s a simple majority vote over multiple supposedly identical representations?
The exploratory stance I am taking at the moment is that deep neural networks are just hierarchical associative memory (whether that is actually true or not.)
The idea then is to increase the capacity of simple associative memory to the point where the hierarchy becomes redundant.
Admittedly it is only a hobby idea and the current systems architecture neural network papers…you might as well feed a donkey strawberries for all that I appreciate them.
I have fun doing little things though. Like the reason the second associative memory option is difficult to train must be due to extreme SGD variance problems. It kind of makes sense.
So maybe I will try some different training method, and learn in the process.
The activation function of first variant only destroys a modicum of information about the input (information wise you can nearly exactly invert) at least before the final summing operation.
The second version basically has a binary threshold activation function, allowing through only one bit of information about the input per function.
I am fully aware of information flow through effects in multi-layer neural networks. It is a prime design feature I take into account.
I am kind of surprised to find it an important factor in associative memory as well, especially radically altering training behaviour.
That explains why binary activation functions worked well for me in associative memory using SGD where multiple parallel layers are combined. Giving basically n-bit information pass through rather than 1 bit pass through.
Okay, cool that I sorted that out for myself. I doubt anyone else has the slightest clue what I am talking about .
You would just think you just solve the simultaneous equations involved in say binary threshold AM or (C)ReLU associative memory and the results would be more or less the same.
They have the same exact recall capacity (with some minimal assumptions) for example.
In practice they are wildly different in performance and training behaviour.
Current neural networks only have moderate information loss per layer, eg. A standard dense ReLU layer.
That is even though the ReLU function itself is a real information stopper, being off completely about 50% of the time. However there are multiple connected alternative information pathways because of the high density of weights.
The end result is a very poor understanding of information flow in neural networks even by top researchers, quite simply because it isn’t apparent, it is quite highly veiled.
And yet in some cases you do get to see.
A quest then is to lower the information loss through activation functions like ReLU or the 2-sided CReLU.
Even 2 sided CReLU will lose 1 bit of information (the sign bit) because it basically ABS’s its input value if the 2 weights involved have different signs.
In my mini java collections there is actually a way around the CReLU bit loss and that is by making the switching decision in CReLU (nearly completely) independent of input value to the function. A random projection solution.
I’ll look into the matter some more, perhaps write some code and see if my opinions concord with reality.
If not then think about it some more, if so great.
So I used 2 consecutive random projections, one for the switching decisions for CReLU (2 sided ReLU) and processed further a second random projection as the data path. The 2 RP vectors should end up almost orthogonal in higher dimensional space.
The individual elements are only slightly correlated.
This should reduce information loss through the CReLU activation functions.
Then it’s up to paid researchers to see if there is anything to it.
public class AMChannelSW {
final int vecLen;
float[] wts;
final float[] switchRP;
final float[] dataRP;
final float rate;
public AMChannelSW(int vecLen,float rate) {
this.vecLen=vecLen;
this.rate=rate/vecLen;
wts=new float[2*vecLen];
switchRP=new float[vecLen];
dataRP=new float[vecLen];
}
float recall(float[] input) {
int r=0x9E3779B9; //random number seed
for(int i=0; i<vecLen; i++) {
switchRP[i]=r<0? -input[i]:input[i];
r*=0x93D765DD; // MCG random number generator
}
whtN(switchRP); //fast Walsh Hadamard transform
for(int i=0; i<vecLen; i++) {
switchRP[i]=r<0? -switchRP[i]:switchRP[i];
r*=0x93D765DD;
}
whtN(switchRP); //switch random projection
for(int i=0; i<vecLen; i++) {
dataRP[i]=r<0? -switchRP[i]:switchRP[i];
r*=0x93D765DD;
}
whtN(dataRP);
for(int i=0; i<vecLen; i++) {
dataRP[i]=r<0? -dataRP[i]:dataRP[i];
r*=0x93D765DD;
}
whtN(dataRP); //data random projection
float result=0f;
for(int i=0; i<vecLen; i++) {
float s=switchRP[i];
float d=dataRP[i];
result+=d*wts[2*i+(s<0f? 1:0)];
}
return result;
}
public void train(float target,float[] input) {
float e=(target-recall(input))*rate/vecLen;
for(int i=0; i<vecLen; i++) {
float s=switchRP[i];
float d=dataRP[i];
wts[2*i+(s<0f? 1:0)]+=d*e;
}
}
static void whtN(float[] vec) {
int n = vec.length;
int hs = 1;
while (hs < n) {
int i = 0;
while (i < n) {
final int j = i + hs;
while (i < j) {
float a = vec[i];
float b = vec[i + hs];
vec[i] = a + b;
vec[i + hs] = a - b;
i += 1;
}
i += hs;
}
hs += hs;
}
float scale = 1f / (float) Math.sqrt(n);
for (int i = 0; i < n; i++) {
vec[i]*=scale;
}
}
public static void main(String[] args) {
AMChannelSW am=new AMChannelSW(256,1.3f);
float[][] example=new float[256][256];
System.out.println("Training example recall.");
for(int i=0; i<65536; i++) {
example[i>>8][i&255]=2f*(float)Math.random()-1f;
}
for(int i=0; i<30000;i++) {
for(int j=0;j<256;j++){
am.train(1f-(j&2),example[j]);
}
}
for(int i=0; i<256; i++) {
System.out.println("Recall: "+am.recall(example[i])+" Target: "+(1-(i&2)));
}
System.out.println("Random input recall.");
float[] r=new float[256];
for(int i=0; i<10; i++) {
for(int j=0; j<256; j++) {
r[j]=2f*(float)Math.random()-1f;
}
System.out.println("Recall for random input: "+am.recall(r));
}
System.out.println("Positive example to negative example recall");
for(int i=0; i<11; i++) {
float proportion=0.1f*i;
for(int j=0; j<256; j++) {
r[j]=(1f-proportion)*example[0][j]+proportion*example[2][j];
}
System.out.println("Recall: "+am.recall(r));
}
}
}
By this logic SDRs have a very low information content. And yet (2000 choose 40) is large enough to hold eveything we need. I dont think the raw data capacity is what is holding back artificial neural networks.
You need something to supply category bits to SDR.
Perhaps a nice AM Channel with binarized +1,-1 outputs that you could turn into 1 and zero.
Presupposing the channel is trained to successfully recognise some particular category of thing.
Anyway after some casual tests AMChannelG seems to win out.
For <vector,vector> associations the simple AM.java code I have in the mini java collections code seems good (a vector version of AMChannelG.)
I also have another variant where the training error vector is more highly spread out. Here is the core class:
public final class AMTotal {
final int vecLen;
final int density;
float rate;
final float[] wts;
final float[] surface;
final float[] workA;
final float[] workB;
final FlipMCG flipIn;
final FlipMCG flipOut;
public AMTotal(int vecLen,int density,float rate) {
this.vecLen=vecLen;
this.density=density;
this.rate=rate/density;
wts=new float[2*vecLen*density];
surface=new float[vecLen*density];
workA=new float[vecLen];
workB=new float[vecLen];
flipIn=new FlipMCG();
flipOut=new FlipMCG();
}
public void recall(float[] result,float[] input) {
java.util.Arrays.fill(result,0f);
flipIn.reset();
flipOut.reset();
flipIn.flipNext(workA,input);
WHT.whtN(workA);
for(int i=0,idx=0; i<density; i++) {
flipIn.flipNext(workA,workA);
WHT.whtN(workA);
if(result==workB) System.arraycopy(workA,0,surface,idx>>1,vecLen);
for(int j=0; j<vecLen; j++,idx+=2) {
float x=workA[j];
result[j]+=x*wts[idx+(x<0f? 0:1)];
}
WHT.whtN(result);
flipOut.flipNext(result,result);
}
}
public void train(float[] target,float[] input) {
recall(workB,input);
for(int i=0;i<vecLen;i++) {
workB[i]=(target[i]-workB[i])*rate;
}
for(int i=density-1; i>=0; i--) {
flipOut.flipPrevious(workB,workB);
WHT.whtN(workB);
for(int j=0,sIdx=i*vecLen; j<vecLen; j++,sIdx++) {
float x=surface[sIdx];
wts[2*sIdx+(x<0f? 0:1)]+=x*workB[j];
}
}
}
public static void main(String[] args) {
AMTotal am=new AMTotal(256,10,1.3f);
float[][] ex=new float[8][256];
float[] recall=new float[256];
for(int k=0; k<ex.length; k++) {
for(int i=0; i<ex[k].length; i++) ex[k][i]=(float)Math.random()-0.5f;
}
for(int k=0; k<300; k++) {
for(int i=0; i<ex.length; i++) am.train(ex[i],ex[i]);
}
am.recall(recall,ex[0]);
for(int i=0; i<20; i++) {
System.out.println("Target: "+ex[0][i]+" Recalled: "+recall[i]);
}
}
}
There are so many variants though, it is not possible for a hobbyist to even scratch the surface. Let alone sort it all out.
I’ll leave it with you. A thing to do is compare CReLU with Decoupled CReLU.
I am not sure I ought to spend my time doing that, it sounds like a bad idea regarding my practical life realities.
A quick search for CReLU turned up: https://proceedings.mlr.press/v232/abbas23a/abbas23a.pdf
Random Projection of input (allowing dimensional increase or decrease).
Activation functions acting on the random projection. Providing increased serperability of input data, improved function approximation and escaping the linear learning limit (only able to learn n responses for an n dimensional input.)
Single shot learning of the linear read-out layer using Moore-Penrose inverse. Leading to a low magnitude weight matrix (improved generalization and lower noise response). Or you can use SGD which if done in some special ways seems to be about the same as Moore-Penrose.
Can replace the square random projection matrix with zero parameter fast transform solutions. Removes n squared parameters and reduces random projection compute from O(n squared) to O(nlog2(n)).
Fast transforms are frequency selection algorithms with low computational cost. A simple and effective way to turn them into random projections is apply a fixed random pattern of sign flips to the input to the transform.
Try finding a frequency in that! All you will get is central limit theorem Gaussian noise.
Except in the sparse case. If only input to the transform is non-zero then fast transforms act as point to pattern mappings. You will get a sine or cosine pattern in the output of a FFT in such a case, even with sign flipping, since only 1 non-zero value is flipped.
However the output of the transform is non-sparse in that case and if you repeat the sign flipping and do another transform then the output will be Gaussian.
If x is the input data, F is sign flippling and H is say the fast Walsh Hadamard transform (a computationally cheaper version of the FFT) and RP is a random projection.
RP(x)=HFHFx
It is often enough for natural data like images that never really will be sparse to use:
RP(x)=HFx
If you want to be sure about things you can use:
RP(x)=HFHFHFx
It is possible to include fixed random permuations to improve randomization, however they don’t play well with the CPU cache memory system and are therefore very slow.
I would notice that the paper you linked in the first message (with the optical first layer) isn’t a straightforward ELM, but does some extra dimension reduction.
So first it projects a 28x28 image to a 500x500 image by simply tiling the original image. This huge (250k pixel) image is used as input to the first “layer”, which projects it to a relatively large 10k size vector.
Which is then used as input to the second, readout layer.
The optical stuff implements the otherwise huge 2.5bilion parameter first random layer.
What is interesting is unlike previous/normal ELMs, this one achieves pretty good accuracy, closer to CNN performance.
However the “fast” optical computation is limited by a 500x500 resolution camera FPS.
At 33 fps the whole 60k MNIST dataset takes ~30 minutes to be read in.
Slightly related - FastRP is a sparse random projection algorithm used initially to represent graph nodes as N-dimension vectors, which can then be later used as such in ML pipelines like K-NN classification.