The grokking challenge?

Regarding your dataset generator: - I haven’t seen mentioned in the paper x and y should be smaller than 97. Of course, learning for larger numbers would be even more difficult.

A few more observations on the paper:

  • do I understand correctly, that the token embeddings for x and y were random? I mean using a similarity-preserving representation like a scalar encoder should be much easier to extrapolate over?
  • the model is small indeed (400k parameters), with only two transformer blocks (of width 128) on top of each other, while usual transformers stack dozens of blocks and are 1k-10k wide.
  • it failed to generalize with more complex equations like
    (x**3 + x*y**2 + y) % 97

I wonder if the latter could be solved with some form of curriculum:

  • have the model learn simple operations first
  • add a couple more blocks on the already trained ones.
  • continue training with complex equations.

Even better would be to rethink transformer metaphor from a simple, very long “ladder” to swapable blocks + recursion using a “router”

Which slowly leads me towards the hive of micro agents concept. I know, I’m biased towards that idea.

2 Likes