The grokking challenge?

I did try with and without decay, it changes general sparsity and hurts memorization a bit but had no effect on the test performance.

but it could just be that the enforced sparsity networks, I’m so fond of are no good for this kind of stuff.

modular arithmetic is not fundamentally a seq2seq task. it can very well be encoded as one-hot inputs and one-hot outputs in a regular NN.


Yes… that’s seq2seq… the sequence is just in a more sparse format

1 Like

I think we-re talking more of a NN with 2N inputs and N outputs where N is e.g.97.

It can be a MLP or anything else in between.

1 Like

I have played a bit around this problem and so far I had best results with the following setup:

  • Each 0 to 96 input integer is encoded as a 400 long dense vector with a VarCycleEncoder(*).
  • Since addition (modulo N) is commutative the dataset contains only pairs (X,Y) without the corresponding (Y,X) pair.
  • For the same reason, instead of having a 800 long dense input for a pair of values, I added the dense representation of Y and X to get a 400 long dense representation of Y and X together
  • the resulting vector (Y and X) is SDR-ified at different sparsification levels e.g. 50/400, 80/400, 100/400, 133/400, 200/400 to obtain different bit only input data sets for all possible (97*96/2=4656) pairs.
  • a random half of these pairs are used for training, half for testing
  • training was done on a sklearn MLP regressor with 97 outputs instead of a classifier. The reasons are is I figured out how to keep the regressor train indefinitely, long after a classifier would have stopped with 100% accuracy on training data. Probably MLP Classifier have similar settings too.
  • The hidden layers for these results were (400,200,200,200,200,200) . Various depths and widths might work, adding more depth has slight improvements.

The best result I got was 99.4% on the testing half after 200 iterations, which takes ~103 sec. on an old, 2 core/4 threads laptop.

What I found it interesting is:

  • encoding matters. A lot. Other encoders had much poorer performance.
  • “SDR-fication” matters again, a lot. Just feeding the MLP the overlapped dense representations had much poorer results within the same compute budget, than shifting highest values in the dense vector to 1 and lower ones to 0
  • Contrary to Numenta’s SDR theory, best results (in this case of MLP trained on SDR) were obtained with very low sparsity, 1/3 bits 1 and even half 1 half 0 bits yielded close to top results.

Now, sure it raises the question upon why using a “special” encoding instead of the “neutral” one-hot encoding. That’s a long discussion. In my opinion, searching for and finding out efficient encodings which might also be useful for different problems is a valid path for several reasons I would gladly discuss.


firstly, whats this encoder? I cant find anything about it.


Sorry it’s a bit too long a story, I opened a different topic for it: CycleEncoder and VarCycleEncoder


Here-s a little update…
In my previous tests I didn’t include pairs of the same value in the dataset, e.g:

for i in range(1,97):
    for j in range(i):

When I corrected by including the same-value pairs too:

for i in range(97):
    for j in range(i+1):

Then accuracy dropped significantly, including on training data.

I also implemented a rudimentary classifier on top of bit pair value maps, which gets close to the MLPClassifier and interestingly, both failed with any SDR encoding I tested other than the VarCycleEncoder dense embeddings addition + thresholding.

I also cannot yet comprehend why concatenating the two SDRs (in a 2x SDR size) with the same encoder, failed, while overlapping dense+thresholding makes the same encoder very useful

There-s some “magic” in the simple addition of two embeddings of the same size.
Interestingly, transformers use this technique a lot - each transformer block input embedding is added(&softmaxed) to the attention-provided embedding representing a weighted “summary” of past inputs for the block.

Which begs a few questions, like:

  • what is special about this (or any “useful”) encoding?
  • would it be possible to develop a theory on what makes an encoding useful for a specific task?
  • is it possible that in the grokking paper referenced in the first message here, the whole phenomenon could be the transformers, after many useless iterations, end up discovering a “useful” intermediate encoding in bottom blocks that allow the top transformer blocks to generalize?
1 Like

My hypothesis is that the more the mod 97 encoder shows the following properties the better it becomes:

Property 1:
-the encoded input representation must be as small as possible

Property 2:
-if f(x,y) = ((x + y) mod 97) and enc(x,y) is the encoding of inputs x and y
-then for all i from 0 to 96, enc(x,y) must be constant for all x and y that satisfies f(x,y) = i

Property 3:
-enc(x_i,y_i) must be similar to enc(x_j,y_j) where |(x_i + y_i) - (x_j + y_j)| = 1 (or some other small number)
-this condition would be helpful only(?) if the output’s representation of neighboring numbers (where I consider 0 and 96 as neighbor) is similar to each other
-i’m not sure if this property is necessary

Would interesting if someone can think up an encoder to support or disprove the above.

How much of a failure is concatenation? Have you tried doubling the width of the hidden layers and see how it performs?

How do you represent the 97 outputs you mentioned above? Did you also use the varCycleEncoder at the output layer?

1 Like

The paper in the first message uses an encoding size of 128, and you may say the encodings are concatenated (transformer sees a stream of symbols)

I had better results with wider encodings than smaller. Though it takes longer to train.

For a large-ish SDR size of 400 bits (overlapped) and NN layers (800,400,400,400,400) the net got 100% train accuracy and 96.93% test accuracy in 25 iterations
With concatenating the two SDRs I got an double sized embedding, and I also doubled the network widths above. After same 25 iterations (which took much longer) the network couldn’t learn anything. the train/test accuracies were 1.3% and 0.97% - that’s as bad as random guessing.

They are simply one hot encodings for the output vector of size 97. The last layer of a classifier. The chosen class (0 to 96) is the one with the highest output, that’s how it’s trained.

1 Like

I think the second property I mentioned will limit how small one can make the encodings to be. That property I don’t think can be completely satisfied as the encodings of all X and Y pairs that have the same output can’t be exactly the same, so the more bits an encoding has the more it can try to be similar to each other using the extra bits in the encoding.

To me an encoding is simply a way to process the inputs into a form more readily mappable into its outputs. Hence, I hypothesized some properties that should make sense. When I think about it the 3rd property (or something similar) to guarantee a form of similarity between inputs or encoded inputs is important to provide some sort of structure for grokking to occur more readily.

Such failure. That’s very puzzling. More parameters might require much more training data. Concatenating also requires the ANN to learn double the amount of input pair relations (e.g. f(X,Y) and f(Y,X)). It might also need to learn the separation/concatenation boundary of the two inputs. Also, concatenating X and Y together would make it harder to satisfy the second property. As the encoding size get bigger dissimilarity between pairs that have same output increases. Smaller sized encoding has more potential to satisfy the second property. These are just guesses for brainstorming.

Also, what you said above got me thinking that maybe SDR-ification in this case is just a method to reduce variance/overfitting? It could be that instead of 0 and 1 binning it into 1, 0.5, and 0 or 1, 0.75, 0.5, 0.25, and 0 could work similarly as good.

1 Like

If you think that any i or j value from x = (i+j) % 97 does not have a preference for any result x kind of makes sense.

The fact the dataset is small does not help either.

One conundrum I had is what should enter the dataset and what not.
Since addition of encoded vectors is commutative, the pair (i,j) has the same encoding as pair (j,i). Which means that when you split the dataset in train and test cases you want to avoid having the encoding (i,j) in training and (j,i) in testing. That’s why I used only half of pairs (with i >= j) to build the whole dataset before shuffling and splitting in train/test subsets of 2376/2377 samples each

However when concatenating the i, j encodings (instead of adding) , the encoding of pairs (i,j) and (j,i) are different. Which means 9409 possible (i,j) distinct encodings. However I considered “fair” to compare the two cases in equal terms, so both (i,j) and (j,i) samples would be put together in either train or test data set, not separated.

I’ll try to see what happens if I relax this restriction.

I also tried RandomForest it kind of fails in various ways, in some cases it would overfit on training data without having a clue about testing, in other case it slightly learns on training data (~50%) while testing gets 10% accuracy - poor but significantly greater than random chance.

1 Like

I’m not sure what parameters you’re using in VarCycleEncoder function to generate encodings, so if you’re interested you could try to calculate the overall degree of similarity between all possible input pairs for the two encodings. Since one encoding is using concatenation and floating number while the other is combined and SDR-ified it’s hard to do direct comparison. Instead can try to leave out the SDR-rification step in order to compare similarity of two encodings of floating-point vectors. Similarity can be done using cosine similarity or whatever you think is appropriate.

The pseudocode would be: for each encoded pair find out how similar it is to all other encoded pairs besides itself using some similarity score. After calculating all the similarity scores find the average similarity score. Do this for each of the two encoding types and compare the average similarity scores between the two.

I think we can glimpse something from the result and could help explain why one is better than the other.

The input space being used is also very small (e.g. we only have 97 different inputs in a 400 dimensional space). I don’t know if this matters or not just pointing it out.

1 Like

I think what is making the encoder work so well is the fact that its already doing part of the computation since it works on the basis of modulos. so in a sense its partitioning the data in several bits and doing essentialy a bunch of if (x * m % 1) + (y * n % 1) >= theshold .

it you get enough of those randomly, some are bound to contain the right answer.

so my prediction is that some bits in the encoded SDRs are directly correlated with the answers we are looking for.


Yes it’s quite likely the fact encoder uses similar operation as the one modeled matters a lot.

It is also interesting that test predictions for i == j fail at high rate (eg. 43 out of 45 samples) , otherwise (excluding i==j pairs from dataset), test accuracy can get in high 90%

This was an useful experiment, first because I never tried to use VarCycleEncoder before, second because I mobilized myself to work on and test my own ValueMap classifier to compare it with the MLP.

I think I’ll postpone further testing here, division modulo N feels quite unnatural.

1 Like

I’m kinda curious to see how my Grid Cell Inspired Encoder would work on a problem like this since it explicitly encodes modulo remainders with respect to a set of chosen prime values.

1 Like

@CollinsEM - I see there-s a javascript example there, any chance you have a python incarnation too?

PS, never mind, description is clear & simple enough. Quite similar concept yet using integers.

I can’t see a means to represent the pair of (a,b) values other than concatenating their corresponding SDRs. The encoding I used did not perform well by concatenating the two SDRs, but by “overlapping” (add the dense encodings for a and b, followed by k-WTA)

Overlapping SDRs directly (bitwise OR)… that can be tried too

1 Like

I do not, but it’s a fairly simple algorithm. I could probably hack a python version, but it would take me a bit.

1 Like

Yes, no need to bother with python. Question though: how do you represent 0, 1 and 2 values?

These would have few (if any) bits active, I doubt ML classifiers like that.

1 Like

I found this 1999 paper that I am about to read:
It’s about weight decay and phase transitions.
I definely will experiment with weight decay in fast transform neural networks, the low compute cost allows for thousands of epochs. And anyway I use thousands of epochs because I’m not a backpropagation expert to do better.

1 Like

The encoder always generates the same number of on bits, so same sparsity. For example, a simple encoder using only three primes: 2, 3, 5, would have 10 total bits (2+3+5) with three active bits at all times. It would also be able to encode 30 unique values (2*3*5) before repeating itself. The first few integers would be:

0: 10 100 10000
1: 01 010 01000
2: 10 001 00100
3: 01 100 00010
4: 10 010 00001
5: 01 001 10000

Here, the spaces have been added for clarity to illustrate the individual modulus modules (i.e. N%2 N%3 N%5). Please see the linked thread for additional details.