Introducing: GPT-4-powered, scientific summaries - Consensus - Evidence-Based

2 Likes

5 years down the road with the current pace of advance, we will live in a different world.
2 Minute papers has this video on GPT-4.
https://youtu.be/7VSWyghVZIg

When they actually realise and fix all the foundational errors in understanding artificial neural networks (information storage in weighted sums, ReLU as a literal switch etc.) that should lead to further improvement and reduction in model size.
However they are just locked into those errors at the moment.

1 Like

The next step following not showing what’s up your sleeve is to claim your trick is not a mere trick but an actual wonder.
Switching your business from research to priesthood.

2 Likes

what “errors” are you talking about?

1 Like

If you can get past some very heavy preconditioning and view ReLU as a literal switch rather than a function that opens the door to many improvements.
You can understand how to directly incorporate computationally cheap dot product algorithms like the FFT and WHT, you can change ReLU to a 2 way switch etc.
In terms of understanding the weighted sum better you can see how to reduce the effect of adversarial inputs by dealing with the case of where the input vector correlates too closely with the weight vector and understand why and how that is a problem. You can understand how a neuron and its forward connected weights in the next layer project a pattern onto the forward layer, like Plato’s shadows on a cave wall.
All this has been explained before on this forum.
However the preconditioning is so intense that it is nearly impossible for people to get past a functional view of ReLU, for example.
Also it seems that in human science, facts don’t speak for themselves, that the position in the social hierarchy of a person presenting information counts highly.

1 Like

If you change relu with a switch you say goodbye to backpropagation, and the followup question is then what do you replace it with in order to train a 1B parameter model?

2 Likes

What? That’s not how it works my friend.

Then you lose the differentiability and can’t backprop through. In practice, you usually use softer version of activation functions - like GeLU, SwiGLU and SoftPlus for well behaved gradients.

lmao :rofl: didn’t ever expect to see Plato’s allegory of the cave used to describe NNs like that.

Adversarial attacks exist, and will continue to exist because fundamentally they compromise a brittle system. Larger models are less susceptible to adversarial attacks, but because you have access to the full weights you can compromise any system.

The brain also isn’t resistant to this. Dyanmical systems in general depend on certain assumptions being met. You can’t really theoretically prove an equilibrium point where those attacks won’t work for any network - because that’s impossible.

There isn’t any hierarchy here. Nobody even knows a scrap of information about me.

The only thing that’s valued here are facts, which seem to be lacking in your response.

1 Like

Lol. Anyway it’s not my job to convert anyone to any perspective or cause anyone to disgorge themselves of dogma.
A ReLU neuron projects a pattern onto the following layer through the weights the neuron is connected to in the following layer, the forward connected weights if you like. And that pattern is of intensity x when x>0 (x being the input to the ReLU). By 2 way switching I mean that a different pattern is projected when x<=0 through an alternative set of forward connected weights. With ReLU nothing would be projected when x<=0.
Backpropagation still works as far as I have tested it, though I mainly evolve weights. There was one Google engineer who said they did try that a little bit but maybe that is not true since the paper I was pointed to didn’t seem to contain the same concept or if it was they didn’t try very hard to get it to work.
One key point about 2 way switching is that information can flow freely in the net under nearly all circumstances rather than simply being felled by ReLU whenever x<0.

1 Like

This can be accomplished with two relu nodes each with opposite biases (one negative and the other positive).

More parameters yet avoiding conditional processing which is very adversarial to GPU-s workflow.

That doesn’t sound quite right. I think you mean having 2 ReLU “functions”, one with input x and one with input -x where x is the value of one of the weighted sums in a net.

And then each of the ReLUs forward connecting to n weights in the next layer. Where n is the width of the net.
Since there are 2 ReLUs for each weighted sum in the net the total number of weights doubles except perhaps for the final layer, depending on what you do.

1 Like

Obviously backprop would still “work”, except the gradients you would calculate would be so sharp and noisy (along with the dying neurons problem) that the network’s overall performance would heavily suffer with scale

1 Like

I don’t see that at all, but then I’m no expert in backprop. When I evolve such nets I get perfect behaviour with very smooth progress to a very low loss. In contrast with ReLU the loss landscape is much rougher with progress to only a lowish loss.
As you would expect from information loss considerations due to a single ReLU blocking entirely half the time.

1 Like

Well, yeah kind of. Instead of input w and -w they can have opposite biases.

The number of parameters increase theoretically twice but in practice the model might not need to learn (or evolve) a negative(inhibiting) node for every possible positive node.

1 Like

That would be interesting to see if you can share it. Obviously the elephant in the room with evolving networks (or anything outside gpu reaching) is how big your elephant can be :disguised_face:

1 Like

Opposite biases would allow gaps (both ReLUs inactive) or overlaps (both ReLUs active) with would make the loss (cost) landscape rather rough again.
The switching at zero is kind of important for good results.
I’ll provide some very simple code in a day or two. There is other code but it involves dot product shortcuts such as the FFT or WHT.

1 Like

Ok, here’s your code. Fortunately it wasn’t too time consuming.
Since it is online I will fill in some comments explaining it during the day as is convenient. You can edit the code since you get a local copy.
https://editor.p5js.org/congchuatocmaydangyeu7/sketches/iZwTnULQ2

3 Likes

I did a blog post.
https://ai462qqq.blogspot.com/2023/03/2-siding-relu-via-forward-projections.html

1 Like