The grokking challenge?

firstly, whats this encoder? I cant find anything about it.

2 Likes

Sorry it’s a bit too long a story, I opened a different topic for it: CycleEncoder and VarCycleEncoder

3 Likes

Here-s a little update…
In my previous tests I didn’t include pairs of the same value in the dataset, e.g:

for i in range(1,97):
    for j in range(i):
    ...

When I corrected by including the same-value pairs too:

for i in range(97):
    for j in range(i+1):
      ...

Then accuracy dropped significantly, including on training data.

I also implemented a rudimentary classifier on top of bit pair value maps, which gets close to the MLPClassifier and interestingly, both failed with any SDR encoding I tested other than the VarCycleEncoder dense embeddings addition + thresholding.

I also cannot yet comprehend why concatenating the two SDRs (in a 2x SDR size) with the same encoder, failed, while overlapping dense+thresholding makes the same encoder very useful

There-s some “magic” in the simple addition of two embeddings of the same size.
Interestingly, transformers use this technique a lot - each transformer block input embedding is added(&softmaxed) to the attention-provided embedding representing a weighted “summary” of past inputs for the block.

Which begs a few questions, like:

  • what is special about this (or any “useful”) encoding?
  • would it be possible to develop a theory on what makes an encoding useful for a specific task?
  • is it possible that in the grokking paper referenced in the first message here, the whole phenomenon could be the transformers, after many useless iterations, end up discovering a “useful” intermediate encoding in bottom blocks that allow the top transformer blocks to generalize?
1 Like

My hypothesis is that the more the mod 97 encoder shows the following properties the better it becomes:

Property 1:
-the encoded input representation must be as small as possible

Property 2:
-if f(x,y) = ((x + y) mod 97) and enc(x,y) is the encoding of inputs x and y
-then for all i from 0 to 96, enc(x,y) must be constant for all x and y that satisfies f(x,y) = i

Property 3:
-enc(x_i,y_i) must be similar to enc(x_j,y_j) where |(x_i + y_i) - (x_j + y_j)| = 1 (or some other small number)
-this condition would be helpful only(?) if the output’s representation of neighboring numbers (where I consider 0 and 96 as neighbor) is similar to each other
-i’m not sure if this property is necessary

Would interesting if someone can think up an encoder to support or disprove the above.

How much of a failure is concatenation? Have you tried doubling the width of the hidden layers and see how it performs?

How do you represent the 97 outputs you mentioned above? Did you also use the varCycleEncoder at the output layer?

1 Like

The paper in the first message uses an encoding size of 128, and you may say the encodings are concatenated (transformer sees a stream of symbols)

I had better results with wider encodings than smaller. Though it takes longer to train.

For a large-ish SDR size of 400 bits (overlapped) and NN layers (800,400,400,400,400) the net got 100% train accuracy and 96.93% test accuracy in 25 iterations
With concatenating the two SDRs I got an double sized embedding, and I also doubled the network widths above. After same 25 iterations (which took much longer) the network couldn’t learn anything. the train/test accuracies were 1.3% and 0.97% - that’s as bad as random guessing.

They are simply one hot encodings for the output vector of size 97. The last layer of a classifier. The chosen class (0 to 96) is the one with the highest output, that’s how it’s trained.

1 Like

I think the second property I mentioned will limit how small one can make the encodings to be. That property I don’t think can be completely satisfied as the encodings of all X and Y pairs that have the same output can’t be exactly the same, so the more bits an encoding has the more it can try to be similar to each other using the extra bits in the encoding.

To me an encoding is simply a way to process the inputs into a form more readily mappable into its outputs. Hence, I hypothesized some properties that should make sense. When I think about it the 3rd property (or something similar) to guarantee a form of similarity between inputs or encoded inputs is important to provide some sort of structure for grokking to occur more readily.

Such failure. That’s very puzzling. More parameters might require much more training data. Concatenating also requires the ANN to learn double the amount of input pair relations (e.g. f(X,Y) and f(Y,X)). It might also need to learn the separation/concatenation boundary of the two inputs. Also, concatenating X and Y together would make it harder to satisfy the second property. As the encoding size get bigger dissimilarity between pairs that have same output increases. Smaller sized encoding has more potential to satisfy the second property. These are just guesses for brainstorming.

Also, what you said above got me thinking that maybe SDR-ification in this case is just a method to reduce variance/overfitting? It could be that instead of 0 and 1 binning it into 1, 0.5, and 0 or 1, 0.75, 0.5, 0.25, and 0 could work similarly as good.

1 Like

If you think that any i or j value from x = (i+j) % 97 does not have a preference for any result x kind of makes sense.

The fact the dataset is small does not help either.

One conundrum I had is what should enter the dataset and what not.
Since addition of encoded vectors is commutative, the pair (i,j) has the same encoding as pair (j,i). Which means that when you split the dataset in train and test cases you want to avoid having the encoding (i,j) in training and (j,i) in testing. That’s why I used only half of pairs (with i >= j) to build the whole dataset before shuffling and splitting in train/test subsets of 2376/2377 samples each

However when concatenating the i, j encodings (instead of adding) , the encoding of pairs (i,j) and (j,i) are different. Which means 9409 possible (i,j) distinct encodings. However I considered “fair” to compare the two cases in equal terms, so both (i,j) and (j,i) samples would be put together in either train or test data set, not separated.

I’ll try to see what happens if I relax this restriction.

I also tried RandomForest it kind of fails in various ways, in some cases it would overfit on training data without having a clue about testing, in other case it slightly learns on training data (~50%) while testing gets 10% accuracy - poor but significantly greater than random chance.

1 Like

I’m not sure what parameters you’re using in VarCycleEncoder function to generate encodings, so if you’re interested you could try to calculate the overall degree of similarity between all possible input pairs for the two encodings. Since one encoding is using concatenation and floating number while the other is combined and SDR-ified it’s hard to do direct comparison. Instead can try to leave out the SDR-rification step in order to compare similarity of two encodings of floating-point vectors. Similarity can be done using cosine similarity or whatever you think is appropriate.

The pseudocode would be: for each encoded pair find out how similar it is to all other encoded pairs besides itself using some similarity score. After calculating all the similarity scores find the average similarity score. Do this for each of the two encoding types and compare the average similarity scores between the two.

I think we can glimpse something from the result and could help explain why one is better than the other.

The input space being used is also very small (e.g. we only have 97 different inputs in a 400 dimensional space). I don’t know if this matters or not just pointing it out.

1 Like

I think what is making the encoder work so well is the fact that its already doing part of the computation since it works on the basis of modulos. so in a sense its partitioning the data in several bits and doing essentialy a bunch of if (x * m % 1) + (y * n % 1) >= theshold .

it you get enough of those randomly, some are bound to contain the right answer.

so my prediction is that some bits in the encoded SDRs are directly correlated with the answers we are looking for.

2 Likes

Yes it’s quite likely the fact encoder uses similar operation as the one modeled matters a lot.

It is also interesting that test predictions for i == j fail at high rate (eg. 43 out of 45 samples) , otherwise (excluding i==j pairs from dataset), test accuracy can get in high 90%

This was an useful experiment, first because I never tried to use VarCycleEncoder before, second because I mobilized myself to work on and test my own ValueMap classifier to compare it with the MLP.

I think I’ll postpone further testing here, division modulo N feels quite unnatural.

1 Like

I’m kinda curious to see how my Grid Cell Inspired Encoder would work on a problem like this since it explicitly encodes modulo remainders with respect to a set of chosen prime values.

1 Like

@CollinsEM - I see there-s a javascript example there, any chance you have a python incarnation too?

PS, never mind, description is clear & simple enough. Quite similar concept yet using integers.

I can’t see a means to represent the pair of (a,b) values other than concatenating their corresponding SDRs. The encoding I used did not perform well by concatenating the two SDRs, but by “overlapping” (add the dense encodings for a and b, followed by k-WTA)

Overlapping SDRs directly (bitwise OR)… that can be tried too

1 Like

I do not, but it’s a fairly simple algorithm. I could probably hack a python version, but it would take me a bit.

1 Like

Yes, no need to bother with python. Question though: how do you represent 0, 1 and 2 values?

These would have few (if any) bits active, I doubt ML classifiers like that.

1 Like

I found this 1999 paper that I am about to read:
https://archive.org/details/arxiv-cond-mat9901179
It’s about weight decay and phase transitions.
I definely will experiment with weight decay in fast transform neural networks, the low compute cost allows for thousands of epochs. And anyway I use thousands of epochs because I’m not a backpropagation expert to do better.

1 Like

The encoder always generates the same number of on bits, so same sparsity. For example, a simple encoder using only three primes: 2, 3, 5, would have 10 total bits (2+3+5) with three active bits at all times. It would also be able to encode 30 unique values (2*3*5) before repeating itself. The first few integers would be:

0: 10 100 10000
1: 01 010 01000
2: 10 001 00100
3: 01 100 00010
4: 10 010 00001
5: 01 001 10000

Here, the spaces have been added for clarity to illustrate the individual modulus modules (i.e. N%2 N%3 N%5). Please see the linked thread for additional details.

2 Likes

A little update here, another paper on the same problem, with a caption reading:

The algorithm implemented by the one-layer transformer for modular addition. Given two numbers a and b, the model projects each point onto a corresponding rotation using its embedding matrix. Using its attention and MLP layers, it then composes the rotations to get a representation of a + b mod P .

Here-s the arxiv link

Author is interested in understanding what happens in NNs and transformers in particular (interpretability issue), by running experiments on smaller models.

This page contains a list of specific topics 200 Concrete Open Problems in Mechanistic Interpretability: Introduction - AI Alignment Forum

2 Likes