I was watching a nice video on the ReLU activation function and the approximating capacity of neural networks:

https://youtu.be/UXs4ZxKaglg

I don’t know why they throw away half the information with ReLU (f(x)=x x>=0, f(x)=0 x<0) when they can use the switch slopes at zero activation function f(x)=a.x x>=0, f(x)=b.x x<0.

Which can be ReLU, straight pass through, abs, negate and a lot of things useful to linear piece-wise approximation.

He also speaks of greater expressiveness using ReLU squared.

ReLU is only nonlinear at the switch points, the Square of ReLU is nonlinear everywhere in its response to inputs greater than zero. You can get smoother interpolation rather than just a crystalline looking linear piece-wise response.

The square is a bit extreme thought. There are very fast bit hack approximations for the square root. X to the power of 1.5 (x.sqr(x)) would be nicer I think because it would not decrease so fast as x approached zero.

Just sqr(x) on its own is not a nice activation function because it is soft binarization with an attractor state at 1 with a strong push away from zero.

It really throws away a lot of information in conjunction with the weighed sum.

There probably are bit hacks you could create to get something like x to the power of 1.5 while avoiding the cost of the multiply in x.sqr(x). I’ll look at that over the next few days.

I have some code for fast sign preserving sqr on an array:

```
// Approximate square root retaining sign. Squashing function for neural networks.
// x>=0 fn=sqrt(x); x<0 -sqrt(-x)
public void signedSqRt(float[] rVec, float[] x) {
for (int i = 0; i < x.length; i++) {
int f = Float.floatToRawIntBits(x[i]);
int sri = (((f & 0x7fffffff) + (127 << 23)) >>> 1) | (f & 0x80000000);
rVec[i] = Float.intBitsToFloat(sri);
}
}
```