A fast transform (Walsh Hadamard Transform) is used as a fully connected set of fixed weights which, if used a conventional neural network activation function would give you a real but completely nonadjustable neural network layer.
Something must bend. What you can do is use individually adjustable parametric activation functions.
With switching based parametric activation functions the resulting fast transform neural network is very similar to a ReLU based network, though faster to compute and using fewer parameters per layer.
What is the commonality between so called fast transform neural networks and conventional ReLU neural networks? ReLU can be viewed as single pole single throw switch, the activation function in the fast transform net can be viewed as single pole double throw switch. In either case they are connecting and disconnecting different (or potentially different) dot products. In one case the dot products (weighted sums) are internally fully adjustable, in the other only the magnitudes of a set of orthogonal dot products are adjustable. If might seem obvious which one to choose but I wouldn’t be so hasty because of the ‘orthogonal’ part and how variously scaled versions of one set of orthogonal dot products will interact when processed by further orthogonal dot products.
In any event for a particular input the state of each switch becomes known.
Each neuron in the net receives some particular composition of dot product of switched (connect or disconnected) dot products of switched dot products…
However the dot product of a number of dot products can be condensed right back down to a single simple dot product of the input vector.
In particular the value of each output neuron can be view as that of a single dot product of the input vector. The entire output vector a matrix of such dot products.
For both types of neural network a particular input causes the switches in the system to be thrown decidedly one way or the other, inducing a particular matrix, and a matrix multiply mapping of the input vector to the output vector.
Then what they have in common is some kind of dancing matrix operation, like Proteus changing shape on the beach.
What if most of the synapses in the brain where fixed and random then the thing would be a vast 3d fast transform random projection. Only a small percentage of synapses/neurons would need to act as activation functions within that matrix of fast transform random projections. That would make learning more efficient in the sense that far fewer parameters would need to be adjusted to get a particular wanted behavior or response.
I’m not saying that is how it is, I’m just putting it forward as an idea.
It does intuitively sound like the expressiveness of having large numbers of adjustable parameters in a conventional artificial neural network would mean a fast transform network was more or less useless. However if you frame the issue in terms of the number of parameters per switch, the number of parameters needed to make a switching decision then things appear in a decidedly different light. It is the switching decisions which actually fit changes in curves, the more per parameter the better all other things being equal.
I don’t see any problem with the biological brain constructing quite efficient random projections, it should actually be able to do that far better than a digital computer. With a neural fan-in fan-out of more than 1000 then 2 layers of neurons would be able to randomly spread out a change in 1 input over 1,000,000 neurons. Then say 1 million parameterized switches, leading into the next random projection. That is better than 1 million neurons which would need 1 million parameters each in a conventional artificial neural network layer.
So if you had a conventional ReLU network of width 1 million, then 1 million weight parameters are needed per neuron for it to fully connect back to the previous layer.
And for those 1 million parameters you only get one bend in the fitting curve, basically. Of course you can go into more details because there are multiple layers, but it seems you don’t get much for so many parameters.
I think I highlighted this video on the ‘breakpoints’ of an ReLU neural network before: https://youtu.be/QEWe-aRBUAs
They suggest it is possible for the biological brain to compute random projections.
I don’t think it is necessary to have strictly separate parametric switching layers to create a fast transform neural network out of that. Parametric switching could be interspersed in the random projection calculating system to yield efficient dot product decision making and recombination.
Is there any biological justification at all for the predominant artificial neural network circuit arrangement? Or was it simply plucked out of the air by someone in the 1950s?
I am not aware of any reason to think the brain does random projections.
If I understand this correctly, it copies inputs and attempts to recall this input when presented with similar inputs. If it is unable to recall the prior input it learns the delta between recall and sensation.
I agree there is no reason to think the biological brain uses random projections as part of its main form of processing. Maybe evolution missed an opportunity or found something better. There is some wiring in insect brains where random projections appear to be used for olfactory discrimination.
In terms of artificial neural network somewhere along the line there was a jump from the single layer perceptron to the multilayer perceptron.
An assumption being that no great inefficiency is caused by stacking perceptron layers and indeed what else would you do?
Thinking about the types of “easy” development/construction methods, duplicating layers and maps are the kind of things that the DNA decoding/expression system use to make neural systems. Evolution is very opportunistic. The functional stacking was sure to be tried and clearly, nature found it to be useful enough to keep doing it in a big way.
I was thinking about how fast transform neural networks with nearly fixed sets of dot-products and only some variability could possibly work and I found a very interesting answer:
“ReLU is a switch. Switching dot products into compositions. If all the switch states are known each composition can be condensed back to a single dot product.
The numeric output of a neuron then is I dot C where I is the input vector and C is the condensed vector.
The exact components of the C vector are not particularly relevant. Only the magnitude of C and the angle between I and C govern the numeric output of the given neuron, the magnitude of I usually being fixed.
The magnitude of C is a summary measure where the exact individual components of C only have a limited influence.
Likewise the angle between I and C is a summary measure.
That means there are numerous choices for the components of C that will produce a wanted output value.
Even in the case of a highly restricted choice of dot products it should be possible to find a reasonable composition to produce a particular approximate dot product output. This explains how Fast Transform (aka. Fixed Filter Bank) Neural Networks can still fit the data very well even with only a few layers and limited variability of the dot products involved. You are not relying on specific components of C being specific values, just the summary measures being specific values, allowing a lot of flexibility.”
That also applies to conventional neural networks particularly when pruning or Lottery Ticket selection is involved.
ReLU is not really a fully non-linear function, there are amplitude invariant aspects.
There are also a lot of unanswered questions why it works so well in conventional neural networks. If all you are seeking is a couple of summary measures being specific values, and even some interchangeability allowed between those, then certainly ReLU is sufficient unto the day.
Attempts to do more than what is sufficient with say amplitude dependent non-linear activation functions will making training much more difficult and also very likely make generalization worse.
I don’t know if there are any papers showing pruning followed by retraining of conventional ReLU networks actually improves generalization, but I would not be too surprised. Since the more restricted dot products (weighted sums) available after pruning avoid the system tuning in too specifically to one-off features in the training set (overfitting.)
People have a very strong intuition that adding amplitude dependent aspects to activation functions would surely help the network “fit” the data better by smoothing fitting curves etc.
And I have tried that many times within the limited hardware I have for testing and it has never helped.
Perhaps that intuition is simply wrong in higher dimensional space, as many intuitions are.
And as I indicated simple switching functions are sufficient to fit the training data without appeal to any kind of magical properties.
There is the assumption that conventional artificial neural networks are the simplest arrangement possible. Certainly diagrammatically but actually not mathematically.
Rather than benefiting by adding in things (like amplitude dependent activation functions) very likely conventional artificial neural networks would benefit by taking things out. Like restricting the dot products (weighted sums) available to certain patterns and or forcing them to be orthogonal.
A real case of less is more.
I wonder if biological neural networks are being thought about in terms of the simplest possible rudiments? Or are mid-level constructs being thought about mistakenly as the simplest exemplars.