The grokking challenge?

I have played a bit around this problem and so far I had best results with the following setup:

  • Each 0 to 96 input integer is encoded as a 400 long dense vector with a VarCycleEncoder(*).
  • Since addition (modulo N) is commutative the dataset contains only pairs (X,Y) without the corresponding (Y,X) pair.
  • For the same reason, instead of having a 800 long dense input for a pair of values, I added the dense representation of Y and X to get a 400 long dense representation of Y and X together
  • the resulting vector (Y and X) is SDR-ified at different sparsification levels e.g. 50/400, 80/400, 100/400, 133/400, 200/400 to obtain different bit only input data sets for all possible (97*96/2=4656) pairs.
  • a random half of these pairs are used for training, half for testing
  • training was done on a sklearn MLP regressor with 97 outputs instead of a classifier. The reasons are is I figured out how to keep the regressor train indefinitely, long after a classifier would have stopped with 100% accuracy on training data. Probably MLP Classifier have similar settings too.
  • The hidden layers for these results were (400,200,200,200,200,200) . Various depths and widths might work, adding more depth has slight improvements.

The best result I got was 99.4% on the testing half after 200 iterations, which takes ~103 sec. on an old, 2 core/4 threads laptop.

What I found it interesting is:

  • encoding matters. A lot. Other encoders had much poorer performance.
  • “SDR-fication” matters again, a lot. Just feeding the MLP the overlapped dense representations had much poorer results within the same compute budget, than shifting highest values in the dense vector to 1 and lower ones to 0
  • Contrary to Numenta’s SDR theory, best results (in this case of MLP trained on SDR) were obtained with very low sparsity, 1/3 bits 1 and even half 1 half 0 bits yielded close to top results.

Now, sure it raises the question upon why using a “special” encoding instead of the “neutral” one-hot encoding. That’s a long discussion. In my opinion, searching for and finding out efficient encodings which might also be useful for different problems is a valid path for several reasons I would gladly discuss.

2 Likes