Is HTM capable of long-term memory?

Hey, I came across the HTM theory lately and I must admit it definitely rings a bell. I have watched the tutorials on Youtube and I must say they are really cool! To give you a bit more context before I ask my question, I am software engineer, not a researcher per se even if I read a lot of research papers on the neurobiology lately.

Here is my question: I am wondering how is long-term memory supposed to be handled in the HTM model. Perhaps it is not handled yet and I would be glad to hear about potential ideas to move forward. I might try some of them.

I am asking this question because, given the model described in the tutorial, if an input (let say an image of a bridge) is seen only a couple of times initially (and technically stored) and not seen anymore in the future, I don’t see how the model could retain the pattern in the columns if the persistence values are decremented linearly. I understand that some synapses might not be decremented due to a low overlap between the current perception and the original one. But what happens if the perception are close together? In that case it is highly likely that the initial perception might vanish. I feel that given a long enough history of varied inputs, it is likely that the model could “unexpectedly” unlearn most of what it learnt at some point. However, the knowledge might still be interesting to keep in mind for the long-run.

If I had to draw a parallel to demonstrate, it would be like this: I saw a beautiful blue bridge in my country when I was young and I can still visualize the details in my head when I need to and today I see the Golden Gate in San Francisco. In that case, technically, the old persistence values of the cells mapping to the blue bridge might likely be in a good position to be decremented (because they are both bridges perhaps around the same size so the layer which memorized the blue bridge might be selected for an update). However even though seeing the Golden Gate might have an impact on the perception I initially had about the blue bridge, it should probably not impact much the “long-term memory” I have about it (the color, the location for instance). Do you see what I mean?

I’m thinking maybe there is or should be some kind of regularization to unlearn in an even sharper non-linear manner to balance between the need for integrating new information vs forgetting something we still had a good idea about. Or maybe some kind of replay buffer could be used to sometimes replay memory and strengthen the knowledge. That could make some kind of parallel with dreams and/or internal visualization.

Any input on that would be greatly appreciated.



I guess every time we look again at, as long as we do it, or later when recalling a thing or event the synapses fire again and again, therefore with sufficient interest they become permanent.

Even when we say “seen it only once” for synapses, that single occasion triggered sufficient activity to become permanent.

1 Like

The question is whether this knowledge is eventually stored in the synapses themselves or somewhere else. The more I look at it the more it seems to me that synapses are not meant to store long-term information but if they do I don’t see how that would be represented in the HTM implementation.

Lately I’m thinking that maybe the increase or decrease of the permanence of each synapse should be modulated by the correctness of the predictions. That would mean that the model remains stable as long as it is not surprised by brand new perceptions. However when it is surprised a lot then it should adapt accordingly even if its model of the world should shift quite a bit and some memory can be lost in the process

In the case of the bridge, that would mean that since both are classified as bridges, it is not so much a surprise to get this new perception and so the decrease of the old synapse should be way smaller in order to keep the information.

1 Like

Honestly, I think the HTM model is incomplete.

its a nice idea but it doesnt do much. It can store sequences and to answer your question, as long as the activation pattern is sparse enough and well regularized (think boosting or dropout) as long as the layer has enough capacity, it should essentially never forget a input.

The maximun theoretical storage capacity I ever managed to get for those kinds of networks is roughly 1/3 of the size of the synaptic matrix measured in bits.

But thats it, the memory item is there… doing absolutelly nothing because the sparsity prevents this memory from even being retrieved normally. And if you can retrieve it often enough, as you said before, it might be overwriten.

And I dont think an HTM can learn good abstract knowledge, at least not at the primary sensory cortices, it might do better once it gets fed already abstracted and nicely formatted representations at higher levels in the cortex.

But I think an HTM might be nice as a temporary storage for data indeed. If you link it to a hippocampus for indexing so that chains don’t get lost, it may work as a good fast replay system that could train another network at a faster pace (more epochs) than just passively taking in sensory input. I think thats what could be happening during sharp wave ripples.

also, this might be unrelated but after painstakinly trying to get the spatial pooler to actually learn to represent MNIST digits, I realized its simply not good enough, pure boosting simply isnt enough to counter the tendency of the hebbian rule to completelly ignore more rare arrangements of bits because of the averaging effect it has, RMBs + boosting work much better to learn sparse distributed encodings.


Did you stack multiple layers of spatial poolers? Did you use temporal memory?
I am curious to know what success rate you managed to get on MNIST with HTM.

I am also trying to work on MNIST and the first issue I can feel is that the first SP layer will always be way too plastic because its permanences will always move too much as new inputs are coming in. A workaround could be to tweak the plasticity but if it is too static then it would not be able to generalize well unless each column is focusing on one digit but I don’t know how that could plausibly happen in the brain.
Do we know if plasticity is reduced in the first neurons of columns compared to higher levels?


Yes, stacking spatial poolers doesnt do much, it only looses information at each layer.

The major problem I am facing with spatial pooler is basically that its way too lossy at encoding data, it ignores completelly some bits and patterns, even when a bit is the most important one for the task, since it has no way of knowing.

I havent even tried to use HTM at anything with it. honestly, the triadic memory is just better and more efficient most of the time.

The most accuracy I managed at mnist classification with the spatial pooler was 75%

And decent reconnstruction of the digit was nearly impossible.

but I noticed that its easy to turn a spatial pooler into a variation of a RBM, all you need to do is generate a reconstruction, subtract the reconstruction from the input and what you get is a signed error, this error can be used to directly update the synapses.

I havent tested this RBM-pooler in MNIST so far but it can get almost lossless reconstructions of the data so I’m using it as a encoder for natural image patches, I’ve had a moderate sucess at image upscaling with it although its only slightly better than bicubic interpolation since its good at anti-aliasing.


I did not know about triadic memory but it seems very interesting. Thanks for the insight.

1 Like

It’s definitely possible to score 97-98% on MNIST using a spatial pooler, or at least a variation of it, combined with numenta’s SDR-classifier.

In my experience: the problem with stacking spatial poolers is that it does not add any information. It does not lose any information, but it doesn’t gain any information either.


Like… where, Akashic records? Kidding, and curious if you think of a plausible candidate.

HTM has the concept of permanent synapses, when a learning synapse reaches a permanence threshold level it becomes permanent. I haven’t delved into Numenta’s or neurology literature to check how biologically justified is this.

1 Like

I’m glad to see someone else coming to this conclusion as well. I’ve got a few theories as to how this might work, but I don’t think any of them are worth sharing just yet. Instead, I’d like to take a few moments to review some details of my current working theory of how HTM can be used to create what I’ve dubbed a “behavioral AI”. I’d like to preface with the fact that a lot of the ideas below are unsubstantiated hypothesis or speculation on my part, and that my understanding of HTM and its applications may significantly differ from the common understanding.

The two parts of the HTM architecture are the spatial pooler and the temporal memory algorithms. Let’s first look carefully at the functions of these two pieces, then I’d like to offer an explanation of how they can be used as part of a larger learning system.

The spatial pooler is a very powerful and robust algorithm that acts as a sort of autoencoder, but not the kind that most people are familiar with. Contemporary autoencoders are typically thought of as an ML model that can learn to generate encodings of data that preserve all or most of the important features of the data while reducing the encoding footprint. Typically this is done by sending an error signal back through an ANN encoder/decoder combo to allow the network to map the “feature space” of a given dataset; the network will learn to prioritize features that correlate closely with the error signal over time (in this case, being the reconstruction loss). The spatial pooler is a bit different, primarily in that there is no error signal to learn from. Instead, the spatial pooler learns to encode things by basically creating a sort of “hash map” for things that it has never seen before. Its set of basic rules ensures that over time inputs will create unique encodings such that inputs with similar features should generate similar encodings. The way I like to imagine this is that each column in the spatial pooler essentially becomes a “feature detector”, and each column is learning which feature(s) from the input it wants to learn to represent. So on an individual level the columns become flags that say “the thing I’ve learned to represent is either present or not”. By itself, this algorithm isn’t really anything special; most ANN architectures can considerably outperform the spatial pooler. It’s really just a basic reinforcement learning algorithm without the key component: temporal memory.

The temporal memory algorithm takes the spatial pooler to the next level (or dimension, I guess) by adding additional functionality that creates some interesting emergent properties. The first property of this algorithm is given away by its name: it allows the HTM system to learn temporal sequences. It does this through 2 mechanisms. First, it “granularizes” the feature detectors such that a single feature (column) may be encoded in many different ways (the specific set of neurons that are active in the column). Second, it utilizes a predictive state in the individual neurons to differentiate similar but distinct sequences. So if the column says “the feature I have learned to detect is present” then the individual neurons in the column say “the feature I’ve learned to detect is present, and we’ve seen it before in the context of what we just saw a second ago”.

At first glance, these functions don’t actually seem to be all that useful. Even if you can reliably generate an encoding for a specific sequence, you don’t have any way to classify that encoding besides looking at all the previous encodings and trying to use the similarity to find an output that was generated by a similar input. As dmac put it:

Notice the qualifier “combined with numenta’s SDR-classifier”. The HTM system doesn’t do any sort of classification on its own, it only provides a feature encoding for the actual classifier (which as far as I’m aware is using a lot of hand waving with other algorithms that would still be as effective even without HTM).

For the longest time I have been trying to figure out a way around this; how to make HTM useful for actual agent-based tasks so that we can actually get to the next phase of using HTM to build AI agents? So far I have three key ideas that I’m working with to try to develop something like this.

First, is the fundamental idea that most people are expecting too much from HTM. Most people seem to view HTM as a general learning algorithm, but it is only a single part of a much larger puzzle. My view is that the goal of the HTM algorithm is to produce what I call the “ground truth”. i.e. It’s entire function is to create an encoding for its input that captures the most salient features of that input. I’m probably not doing a great job of articulating the distinction, but I tend to think of HTM as a learning algorithm for the part of the brain that says “what am I looking at right now?” and the part that says “what do we do about what we’re looking at right now” is a completely different section that follows completely different rules. Making this distinction is important I think because if we view HTM as a reinforcement learning algorithm, then analyzing the system with respect to the correct goal is important. Once I started to view things this way, I started to consider how we can make HTM better at its actual goal instead of trying to get it to reach some abstract goal (like MNIST classification) when it can’t even do what it was designed for effectively.

This leads me to my second idea - modulating the learning signal based on some criteria (namely an error signal). With the temporal memory algorithm, the HTM system is constantly trying to predict the next input by putting neurons into a predictive state such that columns with a predicted neuron will activate only that neuron. This means that incorrect predictions will manifest in two ways: first, neurons that are predicted but not activated indicate that the system predicted something that didn’t happen, and second, bursting columns indicate that the system did not correctly predict a feature that was present. This is an interesting concept because, when we consider this in the context of what was mentioned in the previous paragraph, it allows for a potential metric by which to modulate the learning signal such that the HTM system becomes more effective at achieving its ACTUAL goal which is to encode the “ground truth” such that the rest of the brain can use that encoding to make decisions. My current idea for implementing this is to create a function whereby the permanence delta is a function of the “confidence score” for the current input, which is determined by some combination of the metrics I listed above. This is still in early testing.

The third idea that I’m working with is an extrapolation from the first two. If the HTM system is generating a reliable encoding for a given state, then a behavior (output) can be mapped to that encoding through a general reinforcement learning algorithm such that the same encoding generates the same output (behavior) at a later time. On the other hand, if an encoding is not considered reliable (e.g. it has too many non-activated predicted neurons or too many bursting columns) then the encoding should not be used for learning the associated behavior (output).

So to summarize: The general idea is to separate a goal-oriented AI agent into two sections, one is the HTM encoder, and the other is the behavioral mapping. The HTM section can use its own error signal (bursting, etc) to modulate its own learning signal to improve performance and generate more reliable encodings more effectively. Additionally, the behavioral mapping uses a completely different learning signal to reinforce the connections between the output and the HTM encoding such that when the output satisfies some goal, similar encodings to that input will generate the same behavior at a later time. I’m hoping that these modifications will help to both alleviate some of the issues with HTM as it currently stands (like forgetting, etc) and create a framework for building ‘agents’ that can use HTM to accomplish some task.

I’m currently working on a few projects to validate some of these hypotheses, but it’s slow going so I don’t have much to show for it yet. If anyone would like to work together to establish a more constructive workflow for testing these ideas, I’d love to hear from you. If these changes show promising results, then I already have a few ideas for how more complex agents (read: general learning agents) can be developed using this or similar methods.

I hope that makes sense, if not please let me know and I’ll try to clear things up if I can.


I think HTM is already good as it is, the problem is that we are expexting it to work for any kind of input.

It works perfectly fine as long as the input format is already disentangled. The issue is that we try to feed pixels, random vectors or whatnot, even if we run the spatial pooler, the result is still too entangled.

We basically feed HTM with feature space data believing its a semantic space.

an example, a 256x256 pixel grid with a 16x16 ball bouncing around obstacles, we encode it with spatial pooler and feed it to the HTM and expect it to be able to preddict the path of the bal.

Except that a ball at (20,20) location has zero overlap with a ball at (36,20) so as far as HTM is concerned, they are not the same object and sequence transitions for them are stored separatelly. It will take literally every possible permutation, collision and location before it can make any reasonable prediction.

But this problem becomes trivial is we transform the reference frame to that of the ball so that its always centered and the obstacles that move towards it, and instead feed the ball relative velocity as a timeseries along, in this reference frame, the HTM should have no problem learning to predict a change in velocity when a collision happens.


Have you tried experimenting with something like a grid of patch-based spatial poolers for learning local input signals while the temporal memories learn the spatial relationships between the signals that the spatial poolers represent, rather than learning temporal sequences?
I have a feeling that this kind of configurations might work better because a single spatial pooler with a global receptive field would recognize, say, multiple different 4x4-patches worth of input signals, with some common patch-wise signals, as unique signals with almost no semantic similarity among them, but each of the 16 patch-based spatial poolers will recognize the inputs that have common patch-wise signal as necessarily the same spatial input, increasing the capacity (combinatorially?).
One caveat might be that the “temporal memories” may have to be updated multiple times for a single input because the information have to be propagated to all temporal memories, and connecting each temporal memory to all the other may end up costing a lot of memory.
I have no grounds to claim this, but I think a spatial pooler should be “overcomplete” as in it should be able to learn every single unique input signal it encounters, and the unresolved necessary spatial relationships should be learned by something akin to the temporal memory.
Edit: oh I guess a spatial pooler with local receptive fields and local inhibition, with the temporal memory that learns spatial relationships, would work well as well, huh?


yeah, shared weights would work, it would become a very complicated form of convnet, but still brains dont do that, or rather, they cant.

Somehow brains can solve this issue without needing to copy the weights everywhere but how the heck that is possible is the big question.

An example is that motor skills are transferable. You can write your own name on sand using your big toe, but I bet you have never done it.

I’m farily sure the brain doesnt simply copy the weights from your hand motor cortex into your leg motor cortex.


I don’t know if it matters but I wasn’t talking about weight sharing but just individual spatial poolers arranged as a grid. Plus, convnets do not have a direct analog to lateral connections as in the temporal memory.

1 Like

In that case I dont think it would make much difference at all.

Each spatial pooler would have to see the ball at every possible location, each HTM instance would also have to see every possible transition as well.


Not specifically, each pooler mainly has to know how the ball relates to things that are spatially and temporaly close to it and each pooler would help each other in confirming what they are seeing both spatially and temporally as some kind of vote as described by Jeff in TBT. A good example of that is that if the poolers together detect that the ball is in a specific patch, the other poolers watching other patches might directly “know” that they should not expect the ball in their patch too. If the relative position is also encoded, that could be an aid to predict what is supposed to happen in a given patch by knowing the feedforward input plus the other information given by the other poolers.

This idea was something I thought about too recently and this is something I will test soon. I need to optimize my code to make it run faster for that though.


And I think all that is plausible. A pooler is a cortical column, the votes are due to the connections inside minicolumns and also between macrocolumns, the position encoding would come from the part of the brain handling movement.

1 Like

Well, suppose one of the poolers+HTM by chance only sees the ball moving vertically, but its neighbors saw it moving horizontally.

Woud it learn to represent horizontal motion from its neighbors?


My hypothesis is that one neuron can indeed correlate what it observes from direct sensory input (feedforward) with what other neighboor neurons are reacting to (via proximal and basal dentrites). I would even go further and hypothesize that the first neuron observing the vertical movement should try as much as possible to inhibit any cue from the feedforward signal about the horizontal movement because it has learnt that this specific signal can reliably come from a neighboor cell. The innovative principle in this idea is that by inhibiting some feedforward synapses that are predicting the same thing as other neighboor cell, we would free some information bandwidth (important due to sparsity contraints) to capture as much information as possible about the actual patterns the cell should become expert of and basically avoid crosst talk. This would implement some kind of orthogonality of information that can later serve as a better prediction framework where all cells collaboratively (spatially and temporaly) predict the right thing via column voting.

Concretely, that way of doing might also be much less sensitive to permanence updates in the long run compared to standard HTM I think, because when a cell has become expert at a given feature (as in neural pattern recognition, not abstract recognition), it should not update it’s sensitivity so much anymore.

Also, important to note is that this way of thinking introduces some kind of signal preference leading to different treatments between sensory signal and neighbooring signal. I don’t know if this is plausible though.

1 Like

Yes, that makes sense and I think is probably happening.

But you are missing my point.
A column may learn that its vertical siginal is anticorrelated with the horizontal signal from the neighboring column, but thats true for any signal coming from a ball smaller than the width of the column.

If a ball crosses horizontally this column, it will still fail to predict its path.

1 Like