Deep linear networks are okay


One thing I noticed while experimenting is deep linear neural networks seem to work fine.

Mr Saxe’s thesis paper gives some nice justifications:

A simpler notion is that each layer forms a soft associative memory that you are then compounding into something more complex. Also, as the number of dimensions increase the dot product operations involved become increasingly more selective in what they will respond to, which is a nonlinearity in itself.


I’d clarify by saying that deep linear networks are in principle no more powerful than a single-layer linear network, because they can be reduced to a single linear equation, as the thesis states. The network can still only compute linear functions, no matter how deep. The thesis is just about the dynamics of learning in these overparameterized nonlinear functions, which depends on the depth.


Well, yes I think it all ends up as one big dot product system. However that does not necessarily detract from its capacity to generalize. Or you could view it as a compact (compressed) way to learn partial correlations (partial dot product responses.)
I keep being told deep linear networks are a bad idea and yet my practical experiments tell me a different story. One slight thing different that I sometime do is normalize the vector length to the input of each layer. Which makes initialization less critical.

Anyway Mr Saxe explores the issues a bit more in this video:


The impression I have is that deep linear nets are creating a compressed approximation to a much larger dot product. If that is so then they are probably not worth the training time involved.

That is the question I have been trying to answer.

There may be better ways to proceed with associative memory (AM.) If you have 2 or more different AM’s trained on the same data then things they have seen before will produce a very low variance agreement. Inputs they haven’t seen before and are guessing an output for will have high variance. With that it might be possible to create an interesting context sensitive AM decision tree, with the lowest variance response point indicating a leaf (stopping point.)


Another look at this question is the paper, “Why does deep and cheap learning work so well?”

The TL;DR is this: deep learning is good at approximating low-order polynomials, most of the phenomenon in physics we are interested are low-order polynomials, therefore deep learning works well in practice.


I was curious about this so I ran a little experiment. I trained a bunch of networks to classify handwritten digits (MNIST) using simple networks of N layers, each 784 units wide (the number of pixels in the input), and compared three different activation functions: linear, exponential linear unit, and hyperbolic tangent:


I was interested in seeing how the linear unit compares to the other two traditional activation functions, both in terms of training speed and final classification error. Here are the training and validation curves for the different activation functions, and different number of layers (in {2,4,6,8}). I’ve tested on much deeper networks (10, 30, 100 layers) but the gradient tends to vanish, so they don’t train very well. The figures show a training error and a validation error for each activation function; the lower one is the training error.

I stopped training once it was clear the linear network had plateaued. So it looks like for all the depths I tested, the linear unit compares quite poorly to the nonlinear activation functions, and the final performance seems to be independent of depth as I expected. It does learn more quickly at the start however, which is not surprising given the saturation points of the other two activation functions where their gradient goes to zero.


It looks like the (slight) nonlinearity is buying you a 5 to 1 improvement. It still could be that all the hard work to train deep networks is only really buying you partial correlations with a large dot product, the non-linear aspect improving things somewhat.
It would be interesting if current deep nets where a profoundly awkward way to do category learning. Maybe you could chain associative memory (AM) to learn category vectors. The first learning a very large number of weak associations to one specific category vector (using a low learning rate to allow more examples.) The second AM doing ensemble learning and error correcting associations while helping weed out cross-talk to other categories. And a third mapping to an actual category scalar.

The commonality between deep nets and AM being partial dot product correlations.

I suppose you could put some non-linear preprocessing before a linear deep network and see if that improves things somewhat. That would provide some useful clues.

Anyway when I find the personal motivation I will try with layered AM starting with learning multitudinous weak associations and then refining those down in later layers. I get about a 30% yield on the experiments I do, which, while disappointing sometimes, is far better than with the physical sciences where a 10% yield would often be great.


Okay, there is some non-linearity in “linear neural networks” due to precision errors with say 32 bit float values:
It’s amazing that such a slight thing could have an effect.
Maybe really small non-linear activations are the way to go?
Anyway you can see in the pictures I put here:
the exponential increase in the complexity of the response you get from deep neural networks by adding layers. In the end you end up in chaotic regime where I doubt you would get good generalization. And also there is a butterfly effect where slight changes in the input or the weights in the early layers could cause dramatic changes in the output.
That would mean that trying to train very deep neural networks with specialized low precision hardware mightn’t be the best idea. It also means that SGD or similar are training the net the wrong way around. Or at least it is very slow. In contrast NEAT trains the early layers first and then elaborates. I’ll have to think about it, for surely there is a better way to do things.


Well… then why not just use a non-linear function to begin with? :stuck_out_tongue:

It’s definitely interesting. I’m sitting in a lab with Andrew Saxe for the next few weeks and he’s got a lot of cool stuff to say about the training dynamics of linear networks, mostly because they’re much easier to understand and analyze than non-linear networks. But even he wouldn’t claim they should be used for any practical problem.

For a more practical approach that still captures most of the appeal of linear networks, have you had a look at Tensor Switching Networks? (