Like many people, I’ve been impressed by ChatGPT. After obsessing over how transformers work, I’ve gradually realised that the principles mirror those that we learned from the old works of researchers such as Hopfield and Kanerva. I was delighted when I discovered videos and papers that confirmed this (I wasn’t searching to satisfy confirmation bias, they just popped up):
Hopfield Networks is All You need
Attention Approximates Sparse Distributed Memory
We know that the cortex, cerebellum, etc. use Hebbian learning as a rule, and that neural activity is sparse. So why is modern deep learning research so obsessed with dense representations trained with back-propagation? This is an important question because transformers such as GPT3 are massive, slow and super expensive. The brain as a whole consumes 0.002% of the energy compared to GPT3. These technologies are awesome, but they’re also disappointing, due to their extreme bloat and expense.
Although hardware and cloud companies enjoy people using their dedicated GPUs/TPUs to train & run dense vectors using back-propagation, there should really be more focus on sparse representations using Hebbian learning. We could potentially get something like GPT3 running fast on CPUs while extending the Dmodel. Although we have companies such as NeuralMagic that can sparsify the weight matrices, I feel the paradigm needs to be pruned back to first principles - a throw back to the beautiful principles of the early researchers. This is possible according to the relatively recent research papers. “It turns out they were developing Hopfield networks without realising it” - paraphrasing Sepp Hochreiter on transformer researchers.
This turned into a rant I think
Anyway, hello to the folks who are still here from years ago