BDH (Baby Dragon Hatchling)

Https://arxiv.org/pdf/2509.26507

This paper on BDH (Baby Dragon Hatchling). It’s basically an attempt to replace the Transformer bottleneck with a scale-free graph of “neuron particles.”

The technical gist:

  • No KV Cache: Instead of storing context in a massive memory buffer, it uses synaptic states. It uses local Hebbian rules to update weights during inference.

  • Sparsity: It operates at ~5% activation. Everything is sparse and positive-only, which keeps it closer to SDRs (Sparse Distributed Representations) than traditional dense LLM vectors.

  • Graph Dynamics: It’s structured as a graph rather than a stack of layers. It uses an “integrate-and-fire” cycle (Firing → Competition → Update → Transmission).

  • Scaling: They managed to hit GPT-2 performance levels at 1B parameters. That’s the part that actually matters—it’s a biologically-plausible model that doesn’t fall apart at scale.

  • Interpretability: Because of the sparsity and local rules, the authors claim “monosemanticity.” You can basically trace a concept to a specific physical path in the graph rather than a high-dimensional mystery.

They’ve got a BDH-GPU implementation that maps these graph interactions into linear algebra kernels so it actually runs on current hardware.

The “Thermodynamic Limit” they mention actually prevents the local updates from diverging/exploding when you move past 1B parameters.

Podcast on this

3 Likes

very interesting and promising development - found the fact very curious that you can just concat these models to produce a bigger one that combines knowledge of both parts into one model.

2 Likes

They seem to add a limited attempt at sparsity and a beginning of continuous learning (after classic pre-training with back propagation), but they still use point neurons.

Thanks for the post @MTIzNDU2Nzg5.

1 Like