The grokking challenge?

cezar_t · April 21, 2023, 10:50am

Regarding your dataset generator: - I haven’t seen mentioned in the paper x and y should be smaller than 97. Of course, learning for larger numbers would be even more difficult.

A few more observations on the paper:

do I understand correctly, that the token embeddings for x and y were random? I mean using a similarity-preserving representation like a scalar encoder should be much easier to extrapolate over?
the model is small indeed (400k parameters), with only two transformer blocks (of width 128) on top of each other, while usual transformers stack dozens of blocks and are 1k-10k wide.
it failed to generalize with more complex equations like
(x**3 + x*y**2 + y) % 97

I wonder if the latter could be solved with some form of curriculum:

have the model learn simple operations first
add a couple more blocks on the already trained ones.
continue training with complex equations.

Even better would be to rethink transformer metaphor from a simple, very long “ladder” to swapable blocks + recursion using a “router”

Which slowly leads me towards the hive of micro agents concept. I know, I’m biased towards that idea.

Topic		Replies	Views
HTM underfit/overfit Numenta Theory	8	999	January 16, 2017
Generalization benchmark? Numenta Theory	9	978	June 4, 2018
Learning high order sequences Tangential Theories	6	1237	July 10, 2016
I have begun the development of a new NEAT engine Engineering pyramidal , htm	4	111	December 25, 2024
Numenta Technology Demonstration: Sparse networks perform inference 50 times faster than dense networks, with competitive accuracy Machine Learning	6	1209	November 12, 2020

The grokking challenge?

Related topics