FNet using the Fourier Transform to mix tokens

There is a paper about FNet where they use the Fourier Transform to mix tokens. Using the property that a change in one input element alters all the output elements.
The fast Walsh Hadamard transform (WHT) can effect the same thing. Which is not too surprising since the WHT is just the FFT with the multiplies removed!!! :butterfly: :butterfly: :butterfly:

An ultra short example of using the fast Walsh Hadamard transform to convert the uniform random distribution to the Normal distribution:
There are some slight technical points about that for high precision applications.

If you are a connectionist, you should use fast transforms to do the connecting for you. They are very efficient at it.
The main problem is fast transforms take a spectrum, which is a very biased thing to do. That is not connecting everything in a fair way. However a little bit of cheap preprocessing can solve that.

Just for amusement.
The underlying Sequency patterns of the Walsh Hadamard Transform in natural order (the sequency is the number of transitions between +1 to -1, and -1 to +1):
Then a Walsh Hadamard transform is just how much of each pattern is embedded in say an input image. And that information is complete and invertible. Conceptually a lot simpler than a Fourier transform.

I will say this again about ReLU.
ReLU is a switch. f(x)=x is connect, f(x)=0 is disconnect. A light switch in your house is binary on off, yet connects and disconnect a continuously variable AC voltage signal. In your house you decide the switch state (on or off). In a ReLU the switch state is decided by a predicate (x<0.)
Then a ReLU neural network is a switched composition of weighted sums. The outputs of weighted sums are disconnected from the inputs to other weighted sums, or remain connected.
For a particular input vector all the switch states become known during feedforward. The network then is a particular (switched) composition of weighted sums connecting back to the input vector.
Since that is an entirely linear system (due to the switch states being known) it can be simplified. Down to a simple square matrix mapping the input vector to the output vector.
That viewpoint is so unexpected to some people it can actually prompt not very good behavior.
To me it is a way of getting a handle on the mathematics of a ReLU neual network. However I mainly use it to understand the behavior of an unconventional system.

1 Like

I will clarify even more.
When you graph out ReLU there is a 45 degree line when the input x is greater than zero.
If you graph out the behavior of an electrical switch in the on position, zero volts in gives 0 volts out, 1 volt in gives 1 volt out, 2 volts in gives 2 volts out. That gives a 45 degree line. Obviously if the electrical switch is off you get zero volts out.

I believe the brain melt down I have caused in some senior researchers is due to the their belief that turning on a switch should result in a Heaviside step function type response:
That would only be the case it there were a fixed non-zero voltage on the input to an electrical switch.

The issue for me is, it is making it very difficult for me to explain an unconventional neural network of my devising to people.