Maybe it is a little elementary but I think there are things to be understood regarding information storage in the weighted sum. For instance I think it is nice that the weight vector doesn’t increase in magnitude wildly when you try to store excessive amounts of information in it. As far as I know the current neural network books don’t explain the behavior of the weighted sum too well. Perhaps they ought to do a better job: Algorithm Shortcuts - Weighted Sum Info Storage

There are some more classical techniques that you may want to look into. For example, projective geometry and/or spline interpolation.

If a single NN layer is performing some kind of linear transformation in a high-dimensional homogeneous space, then the normalization procedure can be viewed as projecting the resulting vector back down onto the unit sphere. In this case, the useful information is encoded solely in the direction of the vector rather than its magnitude.

There might also be some insights obtained by looking at spline interpolation. If you squint a little, you might see something that looks a bit like an autoencoder. Say you begin with a large number of sample points, but during regression a control net is generated in a manner that minimizes the error between the spline interpolation and the sample points. Thus, the control net could be interpreted as a form of reduced-order, latent-space encoding of the behavior of the sample points.

Thanks for the interesting idea about spline interpolation. Parametric activation functions that can be trained by gradient descent and that don’t introduce deep local minima into a network would be useful. Someone was bugging me to find out what I read to end up with the viewpoints I have about neural networks. I couldn’t really answer, all I could do was write out my main observations. https://sites.google.com/view/algorithmshortcuts/foundation-neural-net-topics

I posted this on discord which may be of interest after reading foundation topics:

There are lessons from fast transform algorithms for neural networks. A single non-zero input (x) to a fast transform projects one of the basis vectors of the transform across all the output terms with intensity x.

The output (x) of a single neuron in a dense network is forward connected to n weights in the next layer. one for each neuron in the next layer. And the pattern in the weights is projected with intensity x across the full width n of the next layer. You can say the forward connected weights of a single neuron form a basis vector like a fast transform.

To produce an arbitrary target pattern in the next layer you need to sum basis patterns (vectors) of various intensities. And ideally those basis patterns should be orthogonal and complete. However is also the case that the activation functions should be able to provide the intensities required, with binary +1,-1 outputs you have no control over the intensities except sign, with sigmoid activation functions that saturate at say +1 and -1 you would have to make sure they were operating in their more linear region for full control. However there is nothing in gradient descent that will push in that direction. It’s not smart enough to understand sigmoid activation functions should be operated in their linear region most of the time.

Like ReLU really doesn’t have that problem but it does cut off information (block) about half the time which is not good. I guess that is why leaky versions of ReLU abound.

I think in theory highly parametric activation functions with fast transforms acting as fixed weight matrices are more rational then conventional dense networks. For example there are no wasteful permutation symmetries where the same network can be represented by multiple different permutations of weights.

I think the point of RELU is that it can be either a linear pass through or it can be switched off. Thus, it can act as if it has a sort of if/else functionality. The sheer size of the network ensures sufficient redundancy that both branches will be encoded by some combination of weights.

Could someone remind me why weighted sums make sense? Aren’t (biological) synaptic connections binary anyway?

Isn’t this what gradient explosion/vanishing is about (happening in too deep NNs or too many time steps RNNs) and all the tricks that were invented to avoid it, like residuals?

Or you mean shallow-er networks, why don’t their weights go wild?

because a weighted sum *is* an approximation of binary connections with a high performance means (backprop@gpu) to make it a better approximation?

Right. Ok. But then not in an HTM sense. Purely a NN with fully connected point neurons.

Even if each synapse were assumed to be a binary signal, the fact that an axon arbor could potentially form multiple synapses with the same dendrites would allow for that axon to have a variable amount of influence on the downstream neuron. In this case, the weight of the connection between the upstream and downstream neurons would be proportional to the number of connected synapses.

I wonder how prevalent is this kind of synapse weighing in actual brains.

And eventually what mechanisms prevent growing an excessive number of synapses between the same axon-dendrite pair.

## Morphological Diversity Strongly Constrains Synaptic Connectivity and Plasticity

Michael W. Reimann, Anna-Lena Horlemann, Srikanth Ramaswamy, Eilif B. Muller and Henry Markram (2017)

Morphological Diversity Strongly Constrains Synaptic Connectivity and Plasticity | Cerebral Cortex | Oxford Academic## Abstract

Synaptic connectivity between neurons is naturally constrained by the anatomical overlap of neuronal arbors, the space on the axon available for synapses, and by physiological mechanisms that form synapses at a subset of potential synapse locations. What is not known is how these constraints impact emergent connectivity in a circuit with diverse morphologies. We investigated the role of morphological diversity within and across neuronal types on emergent connectivity in a model of neocortical microcircuitry. We found that the average overlap between the dendritic and axonal arbors of different types of neurons determines neuron-type specific patterns of distance-dependent connectivity, severely constraining the space of possible connectomes. However, higher order connectivity motifs depend on the diverse branching patterns of individual arbors of neurons belonging to the same type. Morphological diversity across neuronal types, therefore, imposes a specific structure on first order connectivity, and morphological diversity within neuronal types imposes a higher order structure of connectivity. We estimate that the morphological constraints resulting from diversity within and across neuron types together lead to a 10-fold reduction of the entropy of possible connectivity configurations, revealing an upper bound on the space explored by structural plasticity.