What is next?

The problem with HTM is that it abstracts away too many details. Simplifying the model makes it easier to understand how it works, but it also removes many interesting and important details. HTM is a tool for learning and now that we have mastered it, it is time to move on to more advanced tools.

The obvious next step is to move on to conductance-based models. Their defining feature is that they model the electricity inside of each neuron. Conductance-based models are much closer to the biological reality, and so new scientific data can be incorporate directly into these models without abstracting away any important details. Furthermore, conductance-based models are modular and can be composed together very easily because they all operate on the same underlying electric/chemical principals.

There are serveral existing open-source conductance-based simulators: GENESIS, BRIAN, and NEURON. In my opinion, NEURON is the best of these options. It has the largest uptake within the neuroscience research community and the developers are still working to improve it. One of the primary objectives for the developers is to maintain compatibility with ModelDB, which is a large collection of published neuroscience experiments.

As a community I think we should embrace conductance-based simulators.
I’d like for us to do two things in particular:

  1. We should build new HTM models using conductance-based simulators. This has actually been done before, but we should all do it a few more times regardless.
    See: Sequence learning, prediction, and replay in networks of spiking neurons
  2. We should work to improve the simulation software itself. Most of this software is open source and was written in the 90’s by scientists. Needless to say, it could do with some refurbishment. We, the people on this forum, are good with computers and if we tried we could really make a mark on these projects.

The math of random projections for real world neural systems. The math of information storage in weighted sums. The central limit theorem applied to weighted sums other neural network linear algebra topics.

The DL world seem to have accepted that their level 1 architecture uses backpropagation and uses this architecture to create more model components to build more sophisticated models such as LLMs. Basing on DL advancements today, backpropagation-based learning has proved that it can scale as params are increased and also most importantly it works with any input as long as it’s differentiable.

For HTM systems, I think that we need to come up with a level 1 architecture that we can extend to build more sophisticated models. What is our status of this architecture, are still looking for a contender for backpropagation that is biologically constrained?

I’m interested in looking back and re-evaluating why HTM systems don’t scale. My first thought is that we don’t have yet an algorithm to update model weights with fine-grained values or that mimics gradients. I don’t know if the brain is computing these gradients though.

@dmac In the path you’ve mentioned above, do you think that will progress HTM system as a level 1 architecture?

When I say level 1 it’s similar to a computer’s archtecture, level 1 is processor + memory etc, and we build programs on top of it.


I agree. I don’t think the brain is computing gradients. Though I do think the solution is seeking predictive energy minima. So the same goal as computing gradients, just a different way of finding minima of those gradients.

Instead of computing gradients, I think the brain finds predictive energy minima as network resonances using oscillations.

No need to update model weights with fine-grained values. And with the additional benefit that the energy minima can vary dynamically. Perhaps explaining why LLMs have such enormous parameter blow outs.

I outlined what I think is the appropriate contrast here:


Please, don’t be distracted by the shiny object.

Scientists have detailed models describing almost every aspect of the brain. Although we don’t know everything, we do know enough to do some pretty interesting things. However, neuroscience uses conductance-based models and so in order to directly use their results we also need to use conductance-based models.


Do you know a good summary of how these are functionally different from LIF?


LIF models are a type of conductance based models.
Typically, LIF models simplify the action potentials into abstract “spikes”,
and often they replace the units of voltage with an abstract unit that goes between 0 and 1.


I have played around with a LIF type approach (bn+ parameter) with a variation, which I think is what is missing from prior models. All of the prior models assume that the internal temporal model is structured/represented with the same external temporal modailty. Using the external relativity I believe it creates significant issues with hierarcical concept creation where concept sequences are involved that have a decay.

I think that the temporal aspect may be critical because I have tried languages with a self learning time shift and it splits out words into attention/focus, gather type sense (noun), verb (fan-out concept) and more complex temporal type relativity (i.e. time shift the sensory input sequence). This becomes a bit abstract because the shift can apply to the inferred associated hierarchical concepts an input sequence fragment represents (and has previously learnt) and not the source input directly as we may understand it. We buffer time and throw away raw sensory triggers and map them with the temporally smaller complex hierarchical concepts we have previously learnt. This buffering effectively alters what we learn as a sequence and allows for a lot of raw sensory memory to be thrown away (decayed then deleted) without degredation in learning.

It does this well with English and French, where the verb object noun sequencing is different. I’m still experimenting as I still don’t have a definative answer, although something fits because it splits the language into the same groups, regardless of source verb object noun sequence.

Sleep spindle waves could then allow for concept creation through bridging of points in temporal relativity that otherwise are too distant (fire together decay - or disseparate sequence fragments) only if the internal representation has a different temporal representation.

This I think is the key aspect and part of what is missing from the 90’s. We wire together and manipulate time. If you sequence model in external time it does not work.

With the LIF type approach learning is then just an ongoing sequence of new sensory temporal events added to the memory pool with the equivalent of sleep spindles to learn new hierarchical concepts (new abstract concepts / new dendrite formation).

The process ends up very sparse in compute, unlike the implications on compute for LLM measured by the Cerebras CS-2 with GPT

Source : Cerebras Smashes AI Wide Open, Countering Hypocrites

Still learning, experimenting and convinced this is part of the key. Internal temporal relativity.


I don’t fully understand what you did. It sounds like you encode sequences, and then rearrange them with a form of pooling during a sleep phase??

And you are doing this encoding of sequences with a spiking model? That sounds different to HTM with its chaining of column cells? Or are you training the same chaining of column cells, but doing it with spikes?

Does a form of “sleep spindle waves” then take sub-sequences encoded in this way and rearrange them?

Why waves?


I’m looking at it a little differently. I’m more hoping this will help trigger thoughts outside the box, so to say. I’ll try and explain.

Input sequences are assigned two time references (64bit int - mS resolution) with the first so that updates can be calculated relative to the lapsed time since the last calculation (helps with sparse compute a lot). i.e. the brain has it’s own internal awareness of time as part of the way the system works.

The input sequences form micro fragments of temporal sequences that contain connectivity to and from whatever sensory inputs were triggered / input. This way the sensory stream is agnostic as to whatever it represents, symbol, audio frequency, sensory touch, etc. All the senses are normalised to just a point in time trigger.

Sensory triggers of concepts (neuron firing) echo to all the other points in temporal fragments that concept exists within (i.e. all the connected synapses for the neuron).

This is where continuous learning automatically connects to the whole memory as new memories are just new micro sequences and not additions to matricies to cater for a new input or altering all the weights to cater for one change.

The system is not pre-wired on a random basis as I believe the brain only does this because (a) it can’t form new connections that are further than a few um appart so it needs a high pre-existing probability network for potential new connections (b) a starting point for a feedback loop with some random connections is needed.

I think to follow this random pre-wired route is part of the problem rather than the solution. The pre-wired nature creates a massive compute overhead for the learning phase that ends up needing a planet of compute to work.

Theta frequency I think is a key part as this is a sort of deffinition as to the temporal buffer size or the maximum extent to which a sequence fragment can exist. i.e. you can’t have sequences that are the biological equivalent of say 400mS because the chain of firing would not complete before the next theta cycle starts. All fragments are also shrunk in sleep (or during input) by hierarchical concept construction or resolving to the equivalent of smaller time fragments. i.e. hirarcical concept creation is the brains equivalent of a time compression algorithm.

The second time reference is then a form of relative temporal proximity for the sequence (proxy for one part of the synapse strength calculation) and any relative for other points that are close in the time domain. You have to think about this from a perspective of memory fragments and when and where in a fragment sequence activation occurs and then what is in temporal proximity from a STDP perspective.

HTM is sort of the equivalent of a memory fragment, wereas I look at the system as a collection of millions of fragments, i.e millions of HTM equivalents that are all interconnected.

The typical approach seems to be an assumption all sensory STDP connections are established for all the same duration (i.e. say 8mS) when the brain actually creates various micro bursts of say 3 pulses at 100Hz for one output and 2 pulses at say 110Hz for another. These bursts create the equivalent of a time shift when looking from a LIF perspective. You then need to be able to change the relative time differential for synapses. HTM assumes a consistent lapsed time per layer and time is then contant for the whole network.

Think of it along the lines of an input sequence “big brown house”, where the inputs may ordinarily be buffered internally in short term memory and replayed at Theta frequency, which may then create a roughly 10-20mS between inputs during internal replay cycles. This would allow for STDP to create connections big-brown and brown-house when looked at from a standard approach.

Sleep frequency bursts are double at around 14Hz, which would then allow for the formation of a big-house STDP connection. The higher frequency bridges points in a temporal sequence, which are already connected and also those that arer not. What’s interesting is that the waves propogate across the brain in all directions as a wave rather than singular pre-connected path threads. This would allow for any activations that are in close temporal proximity to be bridged (without the sequence being triggered) and the formation of new synapses to points that were previoulsy.

You have to think about this all in the context of everything is stored as micro sequences, like the zebra finch memory storing the short sonq sequence. The connections are limited in temporal duration because theta waves would otherwise run the risk of creating a positive feedback loop - aka a siezure.

From an input perspective the sensory inputs are buffered and each sensory input may have a relative temporal offset associated with it as to the shift that occurs to the inputs within the sequence after the last active input. This example is using words, but think about it form the aspect of an agnostic sensory input - don’t look at the words as such.

i.e. “red and blue make purple” where the sensory input “and” would create a negative temporal offset for blue to bring red and blue closer in temporal proximity for STDP to occur.

I have tried to look at the whole porocess from a time perspective rather than connections first. Time defines the connections.

I have missed out a lot of what I think is going on, as it’s difficult for me to write it down properly, so appologies if it’s not clear.

Does that make sense, trigger any thoughts ?


It still sounds to me like you encode shorter sequences, and then rearrange them in some idea of a sleep cycle.

I understand you to say you use spike-time-dependent-plasticity (STDP) to encode the sequences. I’m not sure how that can work. I’m vaguely familiar with STDP for encoding images as practiced by hardware startup BrainChip, with apparent biological plausibility proposed by neuroscientist Simon Thorpe. But that was coding light intensity with spike time. I think you get a kind of brightness histogram in spike times, with earliest spikes representing the brightest light… To code sequence… Do you mean Hebbian, if two neurons fire in close sequence they synapse?

Looking at it “from a time perspective” is fine. It is certainly natural to language. And I think it generalizes to be of importance more broadly. And that would fit with the “temporal” emphasis in HTM.

I’m not sure how reliable time alone is going to be as a binding parameter though. You talked about “and” creating a “negative temporal offset”. And somehow a higher frequency “sleep” cycle time allowing connections across an original sequence to link “big” and “house” in the sequence “big brown house”. But I’m not sure how much simply replaying the sequence faster can squash the sequence to capture longer distance dependencies. If that is what you mean.

The way I would capture long distance dependencies like “big” spanning “brown” to connect with "house, is I would form a cluster which contained both “brown house” and “house”. Let’s say {“house”, “brown house”}. So that “big” would become associated with the whole cluster.

You can do that iteratively, so that arbitrarily distant connections can be made.

It’s still time dependent. Because the grouping {“house”, “brown house”} is made based on shared sequence. But it is not locked into a specific time delay.

Similarly “red and blue” can form a cluster with “red”, “blue”… “pink”, “purple”, “indigo” etc. Because they will tend to share sequences. Not an exact time delay. Just a lot of observed (STDP?) shared sequences.


Yes to the first part - memory fragments with the HTM equivalent of for example 7 layers (say 70mS).

The sequences are also rearranged (temporally shifted) at the input stage - before they are embded to long term memory. This is one aspect what I think is missing and occurs as part of the short term memory buffer / active train of thought. Our whole world of language is a sensory derivative from very long elapsed intervals (of pattern recognition of smaller temporal fragments) per word that is communicated (typical 140 words per minute is 500mS per word - or 2-3 theta cycles per word) . Words as an example are not primary sensory inputs and yet the cortex has (concept) columns recognising them so we know that colums are word tiggered and buffered. The words are buffered before they are fully recognised as to what the intended meaning is supposed to be, i.e. the real intended concept. This buffering is where I think that the tempoal allignment is different to the external experience - to allow STDP to work.

We process time internally different as to how we experience time. At this point though transformers with LSTM are the equivalent of an itterative time approach but they need a moving window to add that context when learning. They need the equivalent of the active train of thought to learn.

This is just an example using words and longer timing just for the example - replace the words with anything and timing to what you may expect in biology.

Theta cycle (7Hz with gap between cycles and memory sequence of say 7 symbols)
20ms STDP means big and house occurs at the edge of 20mS and unlikely to create a synapse

Sleep spindle (say 14Hz in stead of Theta 7Hz - double the rate)
big-5mS- brown-5ms-house
Now big and house are 10mS and STDP creates a synapse.

Then take the sequence and alter the timings between words based on the relative context each word really has.

I think the itterative approach has problems when looking at various verb noun adjective sequences because it misses the right context. It just creates every permutation regardless of validity. With a sentance of say 20 words you end up with a pyramid of itterative constructs that would then take 19 cycles to form all permutations with the final step being the equivalent fo the whole sequence.

I know what I’m explaining has flaws and missing a lot. What I suspect is right is the aspect of the inputs having a different internal temporal alignment or representation as to our external perception - for higher level concepts like words. Yet we model strictly in alignment with what we experience.

GPT-4 does make me pause and question what I am doing though.


If you seek feedback:

All I can say is that I’m still unable to resolve anything beyond splitting sequence fragments based on a time signal, and then attempting to build a hierarchy of those fragments based on a different time signal of a “sleep” phase.

No doubt fragments of sequences need to re-order themselves into a hierarchy. And sequence seems fundamental even beyond language.

But I don’t see how a fixed frequency shift in a sleep phase can be the mechanism to produce the complexity of hierarchy we see.

1 Like

I do not know about conductance-based models in this context. However, this reminded me of Michael Levin’s work about non-neural Bio-Electric Network (BEN). In the paper, they were able to use machine-learning to tune the BEN and perform elementary logic gates and compound logic gates using the elementary ones.

I might experiment with the examples you gave above in my spare time.


Where should I start reading about “predictive energy minima” or basically the fundamental computation here that mimics gradients? What are its underlying theories or domain of study for example?

If you’re familiar with the idea of following gradients, then you’re already dealing with energy minima, at least implicitly. That’s what these gradients chase. In supervised learning you’re computing an error between the network and a model. That error is a kind of “energy” of the system. As I recall it’s usually explicit in the equations. Or in unsupervised learning it’s even more explicit. That’s all you’re chasing, some kind of simplification of the system.

In a statistical context (which personally I tried to avoid, because I saw a deeper solution in chaos, and didn’t want to get trapped in statistical ways of thinking) you might look for a comprehensive treatment in standard text like Shuetze and Manning. I tried to keep my distance to approaches purely from a statistical point of view, but I believe their text has become a standard.

Or just Wikipedia:

Maybe it’s familiar to me because physics can be described in this way. The Lagrangian describes all motion as the minimization of the energy of a path, as I recall (it gets interesting, because the definition of all objects becomes waves - oscillations - over such equations… with possible parallels to indeterminate cognitive “objects”. The best treatment I recall of those parallels might be Giuseppe Vitiello: Relations between many-body physics and nonlinear brain dynamics
Giuseppe Vitiello: Relations between many-body physics and nonlinear brain dynamics : Redwood Center for Theoretical Neuroscience : Free Download, Borrow, and Streaming : Internet Archive)

That last, with Vitiello, might start to hint at parallels between energy minimization and oscillations.

I don’t think you need all that theory to start working on a practical basis. But maybe it flavours the way I present arguments.