Regarding your dataset generator: - I haven’t seen mentioned in the paper x and y should be smaller than 97. Of course, learning for larger numbers would be even more difficult.
A few more observations on the paper:
- do I understand correctly, that the token embeddings for x and y were random? I mean using a similarity-preserving representation like a scalar encoder should be much easier to extrapolate over?
- the model is small indeed (400k parameters), with only two transformer blocks (of width 128) on top of each other, while usual transformers stack dozens of blocks and are 1k-10k wide.
- it failed to generalize with more complex equations like
(x**3 + x*y**2 + y) % 97
I wonder if the latter could be solved with some form of curriculum:
- have the model learn simple operations first
- add a couple more blocks on the already trained ones.
- continue training with complex equations.
Even better would be to rethink transformer metaphor from a simple, very long “ladder” to swapable blocks + recursion using a “router”
Which slowly leads me towards the hive of micro agents concept. I know, I’m biased towards that idea.