ReLU neural networks as amplitude modulated dictionaries

There seem to be many vantage points from which to view ReLU neural networks.
When a ReLU neuron is on (f(x)=x) all the forward weights it is connected to light up a vector pattern in the next layer. And that pattern is amplitude modulated by whatever x happens to be.
That is a 1 bit dictionary lookup with amplitude modulation. Each neuron projecting a different lookup pattern onto the next layer, where the patterns sum together.
That suggests a more general neural network where you double the number of weight parameters per layer. When a neuron is on it projects one amplitude modulated pattern onto the next layer, when the neuron is in the alternative state (not now called off) it projects a different amplitude modulated pattern on to the next layer using the additional weight system.
A sparse net then projects sparse amplitude modulated patterns onto the next layer. You can also double up the weights for that and actually project very different sparse patterns for each neuron (binary) state.
For Numenta sparse nets with top-k selection you could double up using max-k, min-k selection with alternative sets of weights for both.

It should improve information flow through a network. Instead of ReLU stopping information flow approximately 50% of the time, it can flow through the alternative weights instead.
Also quite interesting is the effect on sparse nets. With rather radical switching between sparse patterns possible depending on the input to the neuron (which now is primarily a weight set switching device) is above or below zero.


Fast type of neural network based on that:

Interesting proposal. While it doesn’t really align well with the sparse distributed networks often discussed around here, I can see it’s potential for performing simple pattern encoding as a superposition of learned weight vectors.

There’s nothing special about the origin. You might want to introduce a learnable/tunable bias parameter such that f(x,b) = |x-b| (or just x-b). That allows a little more flexibility in your decision hyper-plane (it does not have to pass through the origin). I believe the literature on Support Vector Machines might be useful to your for finding additional insights into this type of model.

1 Like

Yeh, I think it is quite problematic with the top-k type sparsity being explored by Numenta.
I find I don’t need biasing terms because I use (sub) random projections as pre-processing for the neural networks I have experimented with.

I don’t know what exact way you could include biasing if you needed to and I’m not paid to think about it.

Anyway it seems that whatever way you can find to prevent too much information loss as data flows through layer after layer of a neural network the better the result.

I’ll leave it to other people to explore the concept.
Somebody else’s problem.

I think if you want to include a bias term you can include it in the activation function. Which was f(x)=x with entrained switching of the forward connected weight vector. You just change that to f(x)=x+bi
where bi is the bias term you want to include.
The forward connected weight vector switching would then be
If( (x+bi)>=0){
use forward connected weight vector A
use forward connected weight vector B

Bias terms are very suspicious with conventional ReLU neural networks. In theory you could replace them by having one or a few input terms to the neural network set to constant non-zero values.

In a Numenta top-k sparse network there are 2 states a neuron can be in.
In the top-k or not in.
To avoid information loss in the ‘not in’ case I personally would have one weight connected forward to the neuron in the same position in the next layer.
For the ‘in’ case you could have 4 or 5 forward weight connections to arbitrary neurons in the next layer.
The activation function is then just f(x)=x. What you are really doing is just switching between forward weight connections based on whether the neuron is ‘in’ or ‘not in’ the top-k.
There are many variation possible. For example switching between 10 and 2 forward connected weights.
I would also be very inclined to intersperse WHTs between the layers as a connectionist device.
Personal opinion, don’t quote me on it. Lol.