Non-linearity sharing in deep neural networks (a flaw?)

You can view the hidden layers in a deep neural network in an alternative way.
First a nonlinear function acting on the elements of an input vector. Then each neuron is an independent weighted sum of that small/limited number of non-linearized elements.
An alternative construction would be to take multiple invertible (information preserving) random projections of the input data (each giving a different mixture of the input data.) Then apply the nonlinear function to every element of those. Then using the dimension increase give each independent weighted sum an independent set of non-linearized values, instead of sharing just one small set.

What’s the difference? The network layer has the same number of weights either way. I’ll have to think about it.

In the biological brain you likely won’t find such structured non-linearity sharing as in current deep networks. If you are looking for a difference.

2 Likes

If you consider the case of the ReLU nonlinear function where about half the time you would expect the output to be zero there is certainly a considerable difference between the 2 cases. In the conventional case about half the time the input vector elements map to zero via the nonlinear function and are unavailable to all the weighted sums.
In the unconventional (RP) case you don’t have that loss of information, each weighted sum has a different nonlinear window on the original input. Anyway I still need to think about it some more.

In a conventional network the weighted sum outputs in a single layer will be quite correlated/entangled because they are based off too few values. However that is somewhat mitigated by later nonlinearity. Anyway you are making inefficient use of the weight parameters.

In the random projection neural networks you are making full use of the weights and the weighted sum outputs are much more independent .
The price you pay is while there are fast algorithms for the random projection still they are going to dominate the time cost of the network.

Random projection neural networks may be more mathematically or empirically understandable as independent units of associative memory lookup
and separating nonlinearities compared to entangled conventional neural networks. Perhaps entanglement is a reason behind the poor comprehension of how conventional nets work.

Actually the way to reason about it is to use the ‘line separating points in the dataset’ classification viewpoint that everyone is taught, not the linear associative memory viewpoint that it seems no one has been taught since the 1970’s.
In a conventional neural network each neuron in a layer is making classifications off the same vector. None of them have different views of the data.
In the random projection neural network each neuron has a different nonlinear window on the dataset and can make very independent classifications based on that.
Consider that both use exactly the same number of weight parameters. Which do you choose?