Multisynapse optical network outperforms digital AI models

SeanOConnor · July 21, 2025, 1:26am

The optical system may be enacting a sub-random projection.
A random projection is very extreme. You if had an object in motion against a static backgound a random projection will loose large scale movement regularity of position information.
If you reduce the randomness a bit maybe you can get back some more regularity of positional information.

My feeling is that with quasi (sub) random projections you can get some convolution like effects which might help with translation invariance etc.

I have mention on this forum a few times that you can vastly increase the memory capacity of associative memory/extreme learning machines by using random projection locality sensitive hasing (of some input vector) to select weights or blocks of weights from a vast pool. And then plug the selected weights into the extreme learning machine to act on the input vector.
That would give it a vast memory while keeping the compute cost down.

At the moment I am thinking about high density associative memory with no locality sensitive hashing.
Just accepting the compute cost of thousands (up to millions) of random projections of the input data.
And then finding out how well such a system can generalise.

Maybe I should just try with locality sensitive hashing associative memory/extreme learning machines.
This is where a hobbyist just gets overwhelemed.

cezar_t · July 21, 2025, 3:07pm

It is worth trying though.
Have two layer a network with 20000 hidden nodes
Project input data/image into a very large SDR, e.g. 100/20000
Select only the corresponding 100 out of 20000 sub-network
Train it with whatever method you like

SeanOConnor · July 22, 2025, 12:11pm

Here is another blog I did:
https://sciencelimelight.blogspot.com/2025/07/extreme-learning-machines-step-by-step.html

Where I hint at an error spreading mechanism that you can include such that you can make every output neuron involved in solving the problems any other particular neuron has.

SeanOConnor · July 24, 2025, 1:54am

There are any number of ways to use locality sensitive hashing (LSH) to select weights or blocks of weights from a larger pool of weights to increase the memory capacity of extreme learning machines while keeping the compute cost low.
I chose one way in this code:
[Java Mini Collection 12 : Sean O'Connor : Free Download, Borrow, and Streaming : Internet Archive]
(Java Mini Collection 12 : Sean O'Connor : Free Download, Borrow, and Streaming : Internet Archive)

I also showed 2 ways of combining the linear layer neuron outputs. Allowing them to work together as a team, so to speak.
1./ Using a simple (outward facing) random projection.
2./ Using multiple (outward facing) random projections.
The second needs more compute but gives better sharing.

The main concern about using LSH weight switching is how it might increse training time. I don’t have much information about that.

Some improvements to extreme learning machines then are:

1./ Replace the random layer weight matrix with fast transform random projections.
2./ Use CReLU as the activation function for the random layer. Or experiment with Decoupled ReLU.
3./ Use outward facing random projections applied to the output of the linear layer to allow neurons to work cooperatively.
4./ Use locality sensitive hashing to increase memory capacity.

In certain cases if you use locality senstive hashing for weight selection that can act as a non-linear activation function it itself.
Then the random layer activation functions can be avoided or you can say f(x)=x. It could be worth looking at. Just from a few experiments I find ELMs with linear activation function random layers take 3 times as long to train as CReLU ones.

There is a ton of work there for anyone now or in the future who wants to look at applying those concepts to ELMs.
I can’t so much, time constraints, compute constraints, financial…
I’ll leave it with you.

SeanOConnor · August 4, 2025, 2:29pm

If you want to try with a simple <vector,scalar> ELM with locality sensitive hashing block switching of weights:

public class AMChannelSelect {
    final int vecLen;
    final float[] wts;
    final float[] work;
    final char[] keys;
    final float rate;
    public AMChannelSelect(int vecLen,float rate) {
        this.vecLen=vecLen;
        this.rate=rate/vecLen;
        wts=new float[256*vecLen];
        keys=new char[vecLen/8];
        work=new float[vecLen];
    }
    float recall(float[] input) {
        int r=0x9E3779B9; //random number seed
        for(int i=0; i<vecLen; i++) {
            work[i]=r<0? -input[i]:input[i];
            r*=0x93D765DD; // MCG random number generator
        }
        whtN(work); //fast Walsh Hadamard transform
        for(int j=0,k=0;j<vecLen;j+=8,k++){
				int a=work[j]<0f? 8:0;
				a|=work[j+1]<0f? 16:0;
				a|=work[j+2]<0f? 32:0;
				a|=work[j+3]<0f? 64:0;
				a|=work[j+4]<0f? 128:0;
				a|=work[j+5]<0f? 256:0;
				a|=work[j+6]<0f? 512:0;
				a|=work[j+7]<0f? 1024:0;
				keys[k]=(char)a;
		}
        for(int i=0; i<vecLen; i++) {
            work[i]=r<0? -work[i]:work[i];
            r*=0x93D765DD;
        }
        whtN(work);
        float result=0f;
        for(int i=0,k=0; i<vecLen; i+=8,k++) {
			int idx=2048*k+keys[k];
            result+=work[i]*wts[idx];
            result+=work[i+1]*wts[idx+1];
            result+=work[i+2]*wts[idx+2];
            result+=work[i+3]*wts[idx+3];
            result+=work[i+4]*wts[idx+4];
            result+=work[i+5]*wts[idx+5];
            result+=work[i+6]*wts[idx+6];
            result+=work[i+7]*wts[idx+7];
        } 			
        return result;
    }
    public void train(float target,float[] input) {
        float e=(target-recall(input))*rate;
        for(int i=0,k=0; i<vecLen; i+=8,k++) {
            int idx=2048*k+keys[k];
            wts[idx]+=e*work[i];
            wts[idx+1]+=e*work[i+1];
            wts[idx+2]+=e*work[i+2];
            wts[idx+3]+=e*work[i+3];
            wts[idx+4]+=e*work[i+4];
            wts[idx+5]+=e*work[i+5];
            wts[idx+6]+=e*work[i+6];
            wts[idx+7]+=e*work[i+7];
        }
    }
    static void whtN(float[] vec) {
        int n = vec.length;
        int hs = 1;
        while (hs < n) {
            int i = 0;
            while (i < n) {
                final int j = i + hs;
                while (i < j) {
                    float a = vec[i];
                    float b = vec[i + hs];
                    vec[i] = a + b;
                    vec[i + hs] = a - b;
                    i += 1;
                }
                i += hs;
            }
            hs += hs;
        }
        float scale = 1f / (float) Math.sqrt(n);
        for (int i = 0; i < n; i++) {
            vec[i]*=scale;
        }
    }
    public static void main(String[] args) {
        AMChannelSelect amcg=new AMChannelSelect(256,1.3f);
        float[][] example=new float[512][256];
        System.out.println("Training example recall.");
        for(int i=0; i<65536*2; i++) {
            example[i>>8][i&255]=2f*(float)Math.random()-1f;
        }
        for(int i=0; i<1000; i++) { //1000 epochs
			for(int j=0;j<512;j++){
               amcg.train(1f-(j&2),example[j]);
            }
        }
        for(int i=0; i<512; i++) {
            System.out.println("Recall: "+amcg.recall(example[i])+"  Target: "+(1-(i&2)));
        }
        System.out.println("Random input recall.");
        float[] r=new float[256];
        for(int i=0; i<10; i++) {
            for(int j=0; j<256; j++) {
                r[j]=2f*(float)Math.random()-1f;
            }
            System.out.println("Recall for random input: "+amcg.recall(r));
        }
        System.out.println("Positive example to negative example recall");
        for(int i=0; i<11; i++) {
            float proportion=0.1f*i;
            for(int j=0; j<256; j++) {
                r[j]=(1f-proportion)*example[0][j]+proportion*example[2][j];
            }
            System.out.println("Recall: "+amcg.recall(r));
        }
    }
}

I don’t know how the training time increases as you get near to the capacity limit of AM/ELM. Maybe it spirals out of control. I’ll test it tomorrow or when I have time.

It also looks as if you only want a binarized output out of such AMs the capacity is much higher. They could potentially store really a lot of <vector,boolean> associations.

SeanOConnor · August 4, 2025, 2:36pm

I do have a method of turning equal mean boolean values into say an image (where the bits in each pixel have unequal meaning ie. 1,2,4,8…)

It’s in one of the Mini Java Collection things I did.

SeanOConnor · August 4, 2025, 2:58pm

Also in terms of time saving you can pre-compute once all the random projection things since they are fixed and then just train the linear layer by SGD using the pre-computed random layer data.

Maybe Moore Penrose is cheaper than SGD.

Let me ask chatGPT what the computational cost of Moore-Penrose is–

Computational Cost via SVD
For an 𝑚×𝑛 matrix (assuming 𝑚≥𝑛), computing the full SVD takes:
𝑂(𝑚𝑛²)

So I would say a AM/ELM used under its maximum storage capacity use SGD, at or over capacity you would have to try the 2 methods and see for yourself.

SeanOConnor · August 5, 2025, 10:19am

I trained the code “AMChannelSelect” at its capacity limit of 65536 training example s(256*vecLen) with SGD. It took a few minutes.

It sound like that is outside the range of Moore-Penrose which chatCPT roughly indicates as:

Rule of Thumb for Limits
1. In Practice (on a modern laptop or workstation):
For dense float64 matrices:

You can generally handle up to ~10,000 x 10,000 on 16–32 GB RAM.


2. Upper Limits on HPC or GPU:
On high-end machines with 256+ GB RAM or large GPU VRAM, it's possible to perform SVD on:

100,000 x 10,000 or larger (especially for sparse or structured matrices)

Also for <vector,vector> associations there are other reasons to use SGD.

There are at least a couple of ways (using random projections) to entangle all the output neurons together such that they all work together to produce the final vector output. There are some subtle aspects to that.

I don’t think Moore-Penrose can get everything working so well together, or you can say it can’t tie everything so well together.

I showed the 2 ways I know in mini java collection 12.

I’ll maybe try 10 times over capacity (655360 training pairs) but I think I’m going to run out of memory (to store all the training pairs) and the run time will end up 1/2 an hour to 1 hour on 1 core of a celeron CPU (lol.)

If I run out of memory I can use a synthetic reset-able training set.

If a model takes more that 2 or 3 minutes to train these days, I don’t want to run it. However I will try for science, lol. And see how the binary storage capacity holds up.

Used over capacity the outputs will always be mixed with Gaussian noise.

I have doubts if I am using my time well.

SeanOConnor · August 6, 2025, 8:03am

That didn’t work very well for 10 by above capacity recall (capacity 65536, 655360 training pairs, 3000 epochs), even just looking at binary recall:

Training example recall.
Recall: 0   -0.89636475  Target: 1
Recall: 1   0.840884  Target: 1
Recall: 2   0.2864483  Target: -1
Recall: 3   -0.79317194  Target: -1
Recall: 4   0.009755392  Target: 1
Recall: 5   -0.6196442  Target: 1
Recall: 6   0.21226698  Target: -1
Recall: 7   -0.10435073  Target: -1
Recall: 8   0.8231896  Target: 1
Recall: 9   0.51121914  Target: 1
Recall: 10   -0.08090544  Target: -1
Recall: 11   1.3906565  Target: -1
Recall: 12   0.26305285  Target: 1
Recall: 13   -0.13218075  Target: 1
Recall: 14   0.016117029  Target: -1
Recall: 15   0.33171168  Target: -1
Recall: 16   -0.5244387  Target: 1
Recall: 17   0.2491003  Target: 1
Recall: 18   -0.109991275  Target: -1
Recall: 19   -0.17930266  Target: -1
Recall: 20   -0.77553254  Target: 1
Recall: 21   -1.1435546  Target: 1
Recall: 22   0.28032562  Target: -1
Recall: 23   0.2696857  Target: -1
Recall: 24   -0.16724864  Target: 1
......
Recall: 655335   -0.82930076  Target: -1
Recall: 655336   0.5870567  Target: 1
Recall: 655337   -0.009705885  Target: 1
Recall: 655338   -0.13240032  Target: -1
Recall: 655339   -0.35593787  Target: -1
Recall: 655340   0.16603643  Target: 1
Recall: 655341   0.060148865  Target: 1
Recall: 655342   -0.62814444  Target: -1
Recall: 655343   -0.34103644  Target: -1
Recall: 655344   0.48699227  Target: 1
Recall: 655345   0.31397507  Target: 1
Recall: 655346   -0.06336384  Target: -1
Recall: 655347   -0.39793578  Target: -1
Recall: 655348   0.86369073  Target: 1
Recall: 655349   -0.03144267  Target: 1
Recall: 655350   -0.81388927  Target: -1
Recall: 655351   -0.14743353  Target: -1
Recall: 655352   0.31912792  Target: 1
Recall: 655353   -0.009526527  Target: 1
Recall: 655354   0.08939498  Target: -1
Recall: 655355   -0.4895462  Target: -1
Recall: 655356   0.3983178  Target: 1
Recall: 655357   -0.27811414  Target: 1
Recall: 655358   -0.6115337  Target: -1
Recall: 655359   -0.4846582  Target: -1
Random input recall.
Recall for random input: -1.1118413
Recall for random input: -0.52902824
Recall for random input: -0.65795
Recall for random input: 0.26418895
Recall for random input: -0.43826398
Recall for random input: 0.30504435
Recall for random input: -0.5826022
Recall for random input: -0.4846816
Recall for random input: -0.2605414
Recall for random input: -0.09474735

That took 5 hours. Experimenting with different training rates and increased epoch counts is out of my range.

Anyway looking at the noise outputs for entirely random inputs; under-capacity (over-parameterized), capacity and over-capacity (under-parameterized)

Under (256 training pairs):

Recall for random input: -0.037847478
Recall for random input: -0.04747241
Recall for random input: -0.0059197014
Recall for random input: -0.047782857
Recall for random input: -0.12931326
Recall for random input: -0.016003104
Recall for random input: -0.021232385
Recall for random input: -0.013360451
Recall for random input: 0.035687983
Recall for random input: 0.021933176

Capacity (65536 training pairs):

Recall for random input: -5.9002557
Recall for random input: 3.7247772
Recall for random input: -4.377911
Recall for random input: 2.886181
Recall for random input: -6.076513
Recall for random input: -2.7542732
Recall for random input: -4.665001
Recall for random input: -1.4750128
Recall for random input: 7.0121403
Recall for random input: -3.3385997

Over-capacity (655360 training pairs):

Recall for random input: -1.1118413
Recall for random input: -0.52902824
Recall for random input: -0.65795
Recall for random input: 0.26418895
Recall for random input: -0.43826398
Recall for random input: 0.30504435
Recall for random input: -0.5826022
Recall for random input: -0.4846816
Recall for random input: -0.2605414
Recall for random input: -0.09474735

That is because the weight vector magnitude stretches to maximum at capacity to fit all the training examples. Over-capacity the weight vector begins to average out to at a lower overall magnitude. As you may visually experiment with here:

https://sites.google.com/view/algorithmshortcuts/weighted-sum-info-storage

Topic		Replies	Views
Associative Memory via Predicitve Coding Lounge	25	1658	November 2, 2021
Most frequent SDR overpowering all other bits Engineering encoders , question , community	21	642	April 15, 2024
Intelligence is embarrassingly simple General Neuroscience	31	1627	August 24, 2023
Winner Takes All and the Weighed Sum Lounge	13	131	August 22, 2025
Obstacles to widespread commercial adoption of sparsity in the ML industry? Machine Learning sparsity	38	1211	April 6, 2022

Multisynapse optical network outperforms digital AI models

Related topics