Why you should use wide neural networks

I’m busy learning the benefits and virtues of preserving information (as in information theory) as it
flows through a deep neural network:

That maybe explains why ResNets work so well, even though they mightn’t actually be the best solution. The ResNet allowing through more information per layer than a normal net.

You may be interested in a couple papers [1][2] that study how deep networks may compress information (lose mutual information between the input and the hidden layers, while gaining mutual information between the hidden layers and the output) during training. I find that line of work very interesting.

[1] Shwartz-Ziv, Ravid, and Naftali Tishby. “Opening the Black Box of Deep Neural Networks via Information.” arXiv preprint arXiv:1703.00810 (2017). https://arxiv.org/pdf/1703.00810.pdf

[2] Tishby, Naftali, and Noga Zaslavsky. “Deep learning and the information bottleneck principle.” Information Theory Workshop (ITW), 2015 IEEE. IEEE, 2015. https://arxiv.org/pdf/1503.02406

Thanks for the links. It is amazing that current deep neural networks don’t let you dial how much non-linearity to use in the activation function. Especially when you consider the compounding of non-linearity as the data goes through a number of layers. I get a very good improvement by controlling that parameter to match the number of layers and the problem.
With the proviso that I only have the hardware/time to explore simple test problems.

If you are old enough you may remember iterated fractal systems (IFS) for data compression caused a big stir at one stage. I think you could liken deep neural networks to learned fractals, given the compounding of non-linearity layer after layer. That would suggest that the activation function need only stretch and squeeze its response at different input values. And given the chaos theory butterfly effect the activation function ought to be smooth.
Of course any non-linearity with a finite digital representation will cause information loss, exactly because of the stretching and squeezing effect causing numbers that would be different in higher accuracy systems to be the same. And you have to account for that too.
Also the butterfly effect works both ways. Sometimes a butterfly flapping its wings in the Amazon causes a storm in the Atlantic, sometimes not. Some places in a Mandelbrot fractal are completely uniform in value, other places are of an extreme complexity. And so it likely is with deep neural networks.
Anyway learning a fractal is always going to have a very high computational cost. And Google may have made a mistake with the low arithmetic precision specialized hardware they designed. Though arithmetic errors are a non-linearity anyway.
I view deep neural networks as like learning the physical computational structure of the human brain, rather than the dynamic aspects of learning. In which case you can simulate million of years of evolution in the biological domain in a few weeks on a GPU cluster. Sill I can’t help thinking that such networks may not be the optimal thing to do, or they could be.

Would this be fundamentally different than e.g. a sigmoid unit that has a linear regime bounded by varying degrees of nonlinearity? Or a soft rectifier like an exponential linear unit or a softplus?

How would you like this to be implemented? Do you have an equation in mind?

In most (all?) deep learning frameworks you can define your own activation functions. You could make the parameters that tweak the nonlinearity into a trainable parameter that the network can learn (or a non-trainable hyperparameter that you adjust yourself).

If you give me an equation you’d like to use for this variable degree of nonlinearity, I can write you up a quick activation function in Tensorflow or Torch.

I’m going with the premiss that it doesn’t much matter what activation function you use as long as it squeezes its response to input changes at some input values and expands its response at other input values. And maybe it helps if it is a smooth function.
I’m viewing the overall net in the light of chaos theory where there is high sensitivity to initial conditions for some vector input values (bifurcation points) and for others (like the bland areas in the Mandelbrot fractal) far less sensitivity. Irregularly splitting the input into different categories.
In support of such a fractal view of deep neural networks there is this paper on single pixel attacks: https://arxiv.org/abs/1710.08864. A single pixel change at a bifurcation point can cause a big change in response, especially for deep highly non-linear networks.

To control the amount of non-linearity I don’t mean anything else than f(x)=x+c.g(x).
Where f is the activation function c is some constant controlling the amount of non-linearity and g is some non-linear function such as the sigmoid or x*x even.

I’m not sure about the unconditional linear term. That’s equivalent to giving some of your units a linear activation function and the rest a nonlinear one (this is in fact a trick that some people do. Just make half your units in each layer linear). Super easy to do in any DL framework, I’d be interested in seeing what happens if you try it out.

Not sure about x^2 as an activation function either… you’ll hit numerical issues when calculating the gradient on large activations, and even on calculating the activations themselves with deeper networks, and it doesn’t give you much of a squashing region. Worth trying out though, and if it doesn’t work too well then a softplus or exponential linear unit is probably the next best option.

I have tried mixing linear and non-linear neurons, that is very similar as you say. I can use the x^2 function because I evolve the networks rather than use BP. If you say neural nets are fractal in nature then using too deep a net will break up the input state space into too many pieces and you will have to use a very dense final readout layer (probably linear) to stitch all that back together. Anyway it all ends up being very computationally expensive. I don’t have full confidence that deep neural networks are an optimal approach.

If you start with the premiss that you can adjust the weights of a deep neural network to get a stable wanted response in some parts of the input state space at the expense of having more of the chaotic response in other parts of input state space, then perhaps the best thing to do is to have an ensemble of such networks. That would cause any chaotic responses the networks have to cancel out to low level Gaussian noise when summed together. Of course the networks cannot be of uniform construction or they will all have the same response behavior. I’ll try that today but it’s pushing at the limits of what I can test on available hardware.

That seems to work fine. Of course 1 experiment is not the same as a bunch of peer reviewed papers all saying the same thing. I did a Java (+Processing language) example should anyone wish to look through it (alpha code and not the code I used for testing.) The computational burden wasn’t any higher that normal because you can ensemble networks with much fewer weights while still getting a chaos cancelling effect.