There is a nice discussion of the Concatenated ReLU (CReLU) activation function on page 35 of this thesis:
https://scholarsarchive.byu.edu/cgi/viewcontent.cgi?article=8531&context=etd
So you can view the output of CReLU as a 2 dimensional vector ( ReLU(x) , ReLU(-x) ) where x is the scalar input.
Used with a conventional dense layer you would have double the number of neuron output values per layer upon breaking up the 2 dimensional vectors into their component elements.
That means you need double the number of weights in the next layer to connect to double the number of neuron outputs in the previous layer.
The information flowing through a deep neural networks is often of a lower dimensional sort embedded in a higher dimensional space.
So even though ReLU is a serious information blocking activation function there are enough weight connections in an ordinary dense layer for the necessary information to flow through.
Therefore the benefit of CReLU in dense layers is not much, except for convolution type layers.
In ReLU convolution based layers the network is seen to directly construct filters pairs to allow information flow around ReLU.
CReLU allows the information to flow through directly.
Of more interest to me is whether CReLU provides better generalisation in associative memory?
What does it mean for information to flow freely through an activation function and then into a weighted sum. Versus activation functions that block information in various way, eg. Threshold activation functions that allow only 1 bit through, ReLU that blocks completely about 50% of the time and otherwise is completely non-blocking?
I suppose the simplest understanding is to say the weighted sum is looking through an activation function window on the input data.
And some activation functions will be like looking at the data through a window that hasn’t been washed in 30 years. You loose a lot of fine details about what is going on.
CReLU would be nearly transparent. A problem is if each output of the CReLU leads to a weight of same sign (say w1 is positive and w2 is positive.)
Then w1ReLU(x)+w2ReLU(-x) is always positive. You have lost information about the sign of x.
You can have Decoupled CReLU where the switching decision is based on random projection of the input vector and then DCReLU is nearly completely information transparent.
A difference though is behaviour switching, with CReLU that always occurs around the zero point.
Switching with DCReLU occurs based on input vector data characteristics rather than entirely on the input scalar x. And they are not that correlated with each other. Definitely not switching around the x zero point.
I imagine with CReLU you get much smoother decision boundaries and it seems to be very easy to train.
With DCReLU maybe you get smarter, clearer decision boundaries but very jagged ones. And DCReLU seems to require more training.
The question I would like resolved is whether CReLU or DCReLU give better generalisation.
I have to find some kind of test to apply.