If WHT is anything like DFT, then your network in fact results in a form of CNN with the kernel size being the same as the sequence length, with an unusual activation, no less.
Well, yes and no. There are random sign flips at the start that prevent the WHT (similar to the DFT like you said) really correlating anything specific. But certainly correlating something, lol. That results in an initial fair distribution of information about the input data across all the first WHT outputs.
Here are some early papers on the WHT that someone caused me to dig up recently: https://archive.org/details/DTIC_ADA045066
I coded up Switch Net 4 in FreeBasic, which is nearly C with pointers etc. https://archive.org/details/switch-net-4-bpfb
There is a generic FreeBasic version and a Linux AMD64 version with assembly language speed-ups.
I am initially training a net with 128 by 128 color image inputs and outputs on one of the weaker CPUs around. It seems to work fine. I’m training on about 150 images to start with, I would imagine several thousand images would be possible if you put the CPU time in. In terms of multi-threading to use multiple CPU core, I think that may be doable.
It should be possible to port the code to other programming languages, especially if they have pointer capability, but Java would be possible too.
I don’t know what happens if you put the code on a GPU with tensor cores to do the multiple 4 by 4 matrix operations involved etc!
In the comments I had to explain why you would want to destructure the spectral response of a fast transform because people don’t usually think of the other behaviors such as one to all, or one to nearly all connectivity and that the outputs of a fast transform are basically various statistical summary measures on the input vector. Meaning an error in one input to the fast transform only has a minimal impact on a particular output and there is a lot of averaging type behavior going on.
So even if the output of say a simple parametric activation function used in conjunction with a “frozen” neural network weight matrix is quite crude all the averaging effects in a further fast transform will smooth the response out.
You could consider fast transforms as averaging machines: https://ai462qqq.blogspot.com/2023/05/the-case-for-fast-transforms-as.html
In the brain, random projections could provide similar averaging, error cancellation effects. Definitely random projections have been found to exist in insect brains.