Towards demystifying over-parameterization in deep learning

I’m watching this video about over-parameterization in neural networks:
https://youtu.be/XfzlCYHkhmI

To which I added this comment (using my daughter’s account because I’m banned forever by that site.)

“If each layer is normalized to a constant vector length (or nearly so) then the weighted sum in each neuron is an associative memory (AM.) It can recall n patterns to give n exact scalar value outputs. The non-linearity separates patterns that are not linearly separable. Thus you have a pattern of AM, Separation, AM, Separation… It is not surprising in that light that if you over-parameterize the AM it will simply result in repetition code error correction which is more likely to be helpful than harmful…”

I think I demystify better! Or at least I can explain every point in the video to myself in a way that impresses itself as fully coherent to my neural processing unit.

If you recognize the weighted sum as a linear associative memory with limited storage capacity you might choose to adjust the training algorithm to acknowledge that. For example updating the weighted sum in a way that favors an output of zero most of the time and sometimes +1 or -1 within its storage capacity. That would result in a neuron with good signaling ability and a high signal to noise ratio.

Other interventions may be possible such as pruning the network after training and then actually use the zeroed weights for repletion code error correction. Similar to the repetition code error correction that appears to happen with an over-parameterized neural network with early stopping. Except done more rationally.

Then with a linear associative memory at exactly capacity you can recall things but you get no error correction.
I suppose you view error correction as a pull toward an attractor state.

Used under capacity (using it to store fewer patterns than it can hold) you do get repetition code error correction.

Used over capacity, maybe that is good for generalization. The recall is contaminated with Gaussian noise. However vectors generated from such noisy recalls can be very close in angular distance to a target.
Unfortunately with AMs used over capacity there is no focus in the network, slight changes in the input will always produce large changes in the output. Maybe you could add in artificial attractor states.
Then your network would have the form AMs, Separator, Attractor ,AMs ,Separator ,Attractor…
I’m just throwing out ideas here.
https://ai462qqq.blogspot.com/

The dot product is all your Christmases come at once. At capacity your Christmases come one at a time. In the under capacity case your Christmases come 2 or more at a time but you get 2 or more of the same present. At over capacity you get more Christmases but you only get a partial present for each one and Santa gives you some trash as well.
Hope that helps!!!
I presume this information is in the early literature but has been poorly communicated forward and therefore people are not reasoning with it.

After a little digging around I found this paper:
https://archive.org/details/DTIC_ADA189315/page/n7

It is amazing that something so simple as the weighted sum can store memories.
The only disadvantage is that inputs can cause the recall of many things at the same time. The maxium dot product response being the most extreme version of that. You can apply non-linearities to the input data elements before the weighted sum to sort of mitigate that mixing by making different inputs more different from each other and more othogonal in higher dimension. However you are actually shedding information if you do that, so you have to be careful. That gets you almost to the extreme learning machine concept.
The information loss issue is often not well understood. If you apply say the square function to the input information before the weighted sum you lose the information about the sign bits. If you use the threshold function obviously you have cut things down to one bit of information. You could then argue that if you used an invertable function no information would be lost. But still it would cause information loss in terms of the output of weighted sum due to magnitude mismatch issues. The sigmoid function is invertable, however in terms of the weighted sum can act like soft binarization. In some (high magnitude) cases effectively acting to produce a single bit of information.
I’ll stop here because it is a topic that on the one hand is boring and obvious, and one the other hand does not seem to be much used in reasoning about artificial neural networks. For example you could think about clean separation of recall in neurons by forcing sparsity of neuron response during training.

1 Like