Numenta Sparse Nets

To get the highest magnitude output from a dot product the angle between the input vector and the weight vector should be as small as possible. In the Numenta net top-k selection must be highly correlated with smallest angle selection. At each layer the neurons with the smallest angle between the weights and input are being selected. That is clear (perhaps even error correcting) pattern detection. Repeated layer after layer.
In some previous Numenta paper I remember they discussed dot products between sparse input vectors and (more) dense weight vectors and the combinatorial math involved.

I would say the Numenta net is very good at classification. However top-k is an information bottleneck. I doubt such a net can produce high dimensional image responses. It might be very good for text given the precision of pattern detection.

The dot product of several (connected) dot product can be simplified back to single dot product. When all the top-k switch states in the Numenta net become known it collapses to a simple matrix (bunch of dot products with the input vector.) Very likely the angles with the input vector are small, by inheritance during feedforward. That should mean the noise sensitivity is low. You could work that out using the variance equation for linear combinations of random variables applied to dot products.
You could try k-least angle selection or even some more exact least noise selection.

If you want the value 1 out of a dot product that has equal noise on all inputs you can make one input 1 and one weight 1. That cuts out most of the noise. Or you can make all the inputs 1 and all the weights 1/d (d=dimensions). That will average the noise. Averaging the noise is better.
In both cases the angle between the input vector and the weight vector is zero. As you increase the angle toward 90 degrees the length of the weight vector must increase to keep getting 1 out. And the output will as a result get very noisy.
Thus there are 2 factors that control noise response of a dot product, cut or average or anything between, and angle.

I actually learned quite a number of things thinking about that net. Thanks.

Top-k selection is an information bottleneck. If k is 20% of the network width 80% of the neurons in a layer are zeroed.
The net may not be ideal for generating images because of that.
Parametric ReLU is an understood thing in conventional NN research. I think a two sided version of ReLU is even more useful.
A solution to the bottleneck problem might be 2-sided parametric top-k selection. Where if a dot product output is in the top-k selection for its layer it is multiplied by a specific parameter else it is multiplied by a different specific parameter.
So each neuron has 2 parameters of its own.
You can say ‘oh, but that destroys sparsity.’ However the training algorithm can decide to make zero the non-selected parameter if that is optimal.

If you are not going to put convolutions before such a sparse net you might put a random projection. Random projections can provide ‘fair’ dimension reduction or increase. In this case providing each sparse neuron a ‘fair’ sample of all input values to the net. Like wise at the end it cheaper to do a final random projection than a dense final readout layer.
It’s repetitious to say that a fixed randomly chosen pattern of sign flips before a fast (Walsh) Hadamard transform is a quick random projection, but look I said it again. Or Oops I said it again. You can choose a sub-random pattern of sigh flips for some applications. Or apply the RP algorithm twice if the input to the net is sparse itself. Also since the Numenta net uses sparse (low dimension) dot products it may allow non-statistical solutions out of the reach BP. Perhaps an evolution based algorithm would help. :smiley_cat::smiley_cat::smiley_cat:

1 Like

I am not a fan of giving any give neuron a sample of the entire input field.

Real HTM as emulating cortical use “topology enabled” to keep connection local. In context the calculation and training is all local operations.

Since cortical column happen to be only topology know to implement human level intelligence it seems appropriate to try and learn how they to it.

It’s true that artificial nets have done some amazing work in central tendency extraction but it seems we need more to do good AI. In particular, your model not only needs to do pattern completion but serial memory features. You will need to respond to counterflowing streams of information and do everything with local operations.

I agree using a prior random projection would change the character of the net.
Really all I am providing is a list of experimental options to try.
Anyway the entire situation with uncoventional neural networks is just an exploration of switching and dot products.
Dot products:
Fully connected weighted sum.
Randomly chosen sparse weighted sums.
Pruned sparse weighted sums.
Quantized weighted sums.
Convolutions as dot products.
Fast Transforms as dot products.
Fast Random Projections as dot products.
Wavelets as dot products.

Switching:
ReLU
Parametric ReLU
Leaky ReLU
2 sided Parametric ReLU
Top-k magnitude selection (switching)
2 sided parametric top-k selection
Locality Sensitive Hash based switching
Max Pooling

Only a limited number of the combinations have been explored because of delayed understanding that switching is involved. In digital circuits a fixed DC voltage is involved and that constant potential is the only thing switched. And that is the most many CS people have seen of switching. Hence the failure to see the analog aspect, where you could switch 110 or 220 volts AC continuously variable for example.
Once you get that ReLU is a switch and that the switching action is analog and a separate thing to the switching decision then things are a lot clearer.

I see LightOn have an fast optical Random Projection (RP) device you can buy. You could make a neural net out of that with 2 sided Parametric ReLU ( fi(x)=ai.x x<0, fi(x)=bi.x x>=0, i=0 to m.) Maybe 2LU is a name you could use for that :sweat_smile: or not.
Anyway such a net could be: RP, 2LU, RP, 2LU…RP
Training by evolution (continuous gray code optimization.)
You can also use RP for associative memory.

There is not too much point in developing super special hardware for very fast neural networks. The inference speed on even a CPU is likely to be fine for applications.
For training there is no way to engineer the extreme memory bandwidth needed to feed training examples fast enough to special neural hardware.
I think you just use federated learning with say a large bunch of ARM boards to start with.
If some particular algorithm like Numenta’s sparse net or fast transform net became popular then CPU designers could add in a compute block for that. And later I suppose you could have a fully specialized CPU.
People always want the super greatest hardware. It is rather funny to say cheap conventional hardware will actually do fine.

1 Like

We have forum members that have access to FPGA development kits where it is almost as easy to cast hardware as it is to write code on a conventional PC.

Yeh, I would guess memory bandwidth isn’t a problem for the Numenta sparse net on FPGA hardware. I think LightOn’s situation is more complex.
You can distribute back-propagation over many CPU cores but that requires a lot of network bandwidth.
Or you can distribute evolution over many CPU cores which requires only a small amount of network bandwidth.
Evolution, when I tried it gave better generalization than BP. However I am a non-expert at BP.
Most NN experts would reject you can evolve the weights in a large NN. However it worked for me!!! And very likely because both BP and evolution can only really search the same sub-set of fitting solutions that are statistical in nature and that don’t involve super complicated function composition. Kind of a simplified mode of learning existing within neural networks.

I guess there are use cases for ultra-fast neural networks like radar where 50 million plus inferences per second would just be okay. There is even quantum radar where an entangled pair of radio frequency photons are created. One is sent out and one is kept, likely slowed in a meta-material and observed. I was just thinking the number of photons you would need to send out is way less that with conventional radar because you are not relying on reflection.
Also CERN would have use for super fast nets. I saw on phys.org there is some work being done on analog Parametric ReLU.
Anyway, I suppose one way to test if a neural net is purely a statistical device is to remove one weight at a time or one neuron at a time, and see if that only ever causes a small statistical change in the behavior of the net. Or sometimes has a big impact as it subverts a neuron that has a special meaning , function or role to play.