If WHT is anything like DFT, then your network in fact results in a form of CNN with the kernel size being the same as the sequence length, with an unusual activation, no less.

Well, yes and no. There are random sign flips at the start that prevent the WHT (similar to the DFT like you said) really correlating anything specific. But certainly correlating something, lol. That results in an initial fair distribution of information about the input data across all the first WHT outputs.
Here are some early papers on the WHT that someone caused me to dig up recently: https://archive.org/details/DTIC_ADA045066

I coded up Switch Net 4 in FreeBasic, which is nearly C with pointers etc. https://archive.org/details/switch-net-4-bpfb
There is a generic FreeBasic version and a Linux AMD64 version with assembly language speed-ups.
I am initially training a net with 128 by 128 color image inputs and outputs on one of the weaker CPUs around. It seems to work fine. Iâ€™m training on about 150 images to start with, I would imagine several thousand images would be possible if you put the CPU time in. In terms of multi-threading to use multiple CPU core, I think that may be doable.
It should be possible to port the code to other programming languages, especially if they have pointer capability, but Java would be possible too.
I donâ€™t know what happens if you put the code on a GPU with tensor cores to do the multiple 4 by 4 matrix operations involved etc!

In the comments I had to explain why you would want to destructure the spectral response of a fast transform because people donâ€™t usually think of the other behaviors such as one to all, or one to nearly all connectivity and that the outputs of a fast transform are basically various statistical summary measures on the input vector. Meaning an error in one input to the fast transform only has a minimal impact on a particular output and there is a lot of averaging type behavior going on.
So even if the output of say a simple parametric activation function used in conjunction with a â€śfrozenâ€ť neural network weight matrix is quite crude all the averaging effects in a further fast transform will smooth the response out.

You could consider fast transforms as averaging machines: https://ai462qqq.blogspot.com/2023/05/the-case-for-fast-transforms-as.html
In the brain, random projections could provide similar averaging, error cancellation effects. Definitely random projections have been found to exist in insect brains.

Latest information:
Training a neural network on 128 by 128 color images on a single celeron (aka junk) CPU core: https://youtube.com/shorts/8kS6Xcg4GbA
The training set is 1100 images, the number of parameters is equivalent to 24 images, so vastly under-parameterized.

In statistical learning theory in which they study under- and over-parameterization in machine learning, over-parameterization is determined not by comparing (the number of data points)\times(the dimensionality of the data) and the number of parameter as youâ€™ve described, but by comparing just the number of data points and the number of parameters. (or sometimes the number of total bits!)
For example,

This result comes from a phenomena in machine learning, coined as double descent. You can see that the training error converges to zero when the number of parameters is in the same order of magnitude of the number of data points, entering the interpolation regime. So in this perspective, your model is over-parameterized.
But Iâ€™m not an expert in this field I could be wrong.

Also, you can get the intuition by observing a linear model in a regression problem. If you have a dataset of n data points with the dimensionality of d, represented as a input matrix and a target vector \mathbf{X}_{n\times d}, \mathbf{y} \in \mathbb{R}^n, the model has to learn a weight vector \mathbf{w} \in \mathbb{R}^d. Observing the equation describing what we want, \mathbf{Xw}=\mathbf{y}, you can immediately see that you can find the solution \mathbf{w}=\mathbf{X^+y} where \mathbf{X^+} is the pseudoinverse of \mathbf{X}. The solution tells us when n=d, you get the exact solution(i.e. the training error is zero) but when n>d, the problem becomes overdetermined/under-parameterized and the training error becomes larger than zero, and for n<d, the problem becomes underdetermined/over-parameterized and there are infinite number of distinct solutions that lead to the training error of zero.
You can see how over-parameterization is not about n\times d<d (which doesnâ€™t make sense in this case anyway), but more closely related to n<d.

I donâ€™t understand that. All I say is the number of parameters in the net is far too small for a linear system (matrix mapping) fit of 1100 images.
I found that I can get about 70% the efficiency of tuned assembly language code using Java and a server Java Virtual Machine (JVM.) Which is just at the point where you shouldnâ€™t write very specialised (machine dependent) code. https://archive.org/details/switch-net-java-2
I think back-propagation in Switch Net is very interesting because there is zero information loss going backward, there are only sign flips and information preserving transforms. You could potentially have extremely deep nets with no vanishing gradients etc. It might also be good for recurrent nets.
Anyway the gradient can never go the wrong way though it might not always point exactly in the direction of steepest descent.

The next phase would be to do proper comparison metrics with conventional artificial neural networks and to move the code up to GPU form. I simply flag the code/idea and leave it to others to follow up if they want to.

Ok, I understand you now about your under parameterization argument. If the data is compressible then the amount of parameters required for that compressed form is normal parameterization and less that is under.