Https://arxiv.org/pdf/2509.26507
This paper on BDH (Baby Dragon Hatchling). It’s basically an attempt to replace the Transformer bottleneck with a scale-free graph of “neuron particles.”
The technical gist:
-
No KV Cache: Instead of storing context in a massive memory buffer, it uses synaptic states. It uses local Hebbian rules to update weights during inference.
-
Sparsity: It operates at ~5% activation. Everything is sparse and positive-only, which keeps it closer to SDRs (Sparse Distributed Representations) than traditional dense LLM vectors.
-
Graph Dynamics: It’s structured as a graph rather than a stack of layers. It uses an “integrate-and-fire” cycle (Firing → Competition → Update → Transmission).
-
Scaling: They managed to hit GPT-2 performance levels at 1B parameters. That’s the part that actually matters—it’s a biologically-plausible model that doesn’t fall apart at scale.
-
Interpretability: Because of the sparsity and local rules, the authors claim “monosemanticity.” You can basically trace a concept to a specific physical path in the graph rather than a high-dimensional mystery.
They’ve got a BDH-GPU implementation that maps these graph interactions into linear algebra kernels so it actually runs on current hardware.
The “Thermodynamic Limit” they mention actually prevents the local updates from diverging/exploding when you move past 1B parameters.
Podcast on this