Trying to understand how to advance HTM

Hi all,

I’ve been looking at the HTM algorithm to work on for my thesis. The idea that we’re modeling the neocortex of the brain fascinates me and the algorithm is really interesting. However, the algorithm seems to need improvements in a specific area of large interest in the current machine learning community and that is vision tasks. I know this isn’t the first time this topic has been brought up but I’m new to the community and still to HTM as a whole and was wondering why HTM doesn’t work well for vision tasks. What are HTM’s current limitations? Why doesn’t it work well for vision tasks in general?

My goal would be to use HTM for video classification as that is a spatio-temporal problem but I’m not quite sure how to even get started into what I should research to figure out an algorithm/additions to current HTM theory. Any resources or ideas would be very much appreciated.

I have made my own implementation of the Spatial Pooler based off HTM papers and I can get around 90% on MNIST using a softmax classifier at the end. I haven’t used Nupic yet but I currently have it fully installed.

Thank you for your time and have a nice day. :slight_smile:


I think there are a couple of reasons HTM isn’t currently suited for vision tasks.

One problem is that hierarchy is not yet a part of HTM. The algorithms which are good at vision tend to utilize multi-layer hierarchies.

The second is that vision is really a sensory motor task if I understand it properly. The eyes move over the image to explore it. A coordinate system and scale need to be established from these movements (and through voting between neighboring sensory surfaces).

This quote from Jeff was helpful to me for understanding the idea:


As @Paul_Lamb has pointed out, the fovea is essentially a straw, with a large lower resolution, motion sensitive, “aiming” window around it.

The fovea is employed for accurate vision in the direction where it is pointed. It comprises less than 1% of retinal size but takes up over 50% of the visual cortex in the brain The fovea sees only the central two degrees of the visual field, (approximately twice the width of your thumbnail at arm’s length)

The brain steers this around to take multiple small snapshots of the world that is assembled into a total picture. I made an example of this process here:


Macrocolumns will probably help. The retina (or pixels of a video) is a much bigger sensory input than HTM normally gets. It needs to be broken into pieces and re-integrated, whether through hierarchy like Paul Lamb mentioned or by a lateral method like the new ideas about object recognition.


Are you already familiar with deep learning and how it applies to image recognition?

90% on mnist is pretty good… you’re essentially using spatial pooling for convolutions, with softmax on the end?

How are you feeding data in? Are you using raw full-image pixels, binary white-notwhite, or using a sliding frame approach?\

Are you employing any methods for strengthening or weakening your connections between your input space and your spatial pooler? And if you “activate” one of your units in your spatial pooler (heading out into your input space), are you seeing the formation of rudimentary edges/lines/curves?

Next step (unless others have a different opinion) would be to see if you could use HTM going deeper with your network (still using HTM methods as opposed to backprop), or use an HTM classifier to make predictions.

An interesting experiment would be to backfeed through your network… when your softmax says “It’s a 7”, feed its output value down your network and see what pops out of your “input” space… if you could prove consistently from end to end how your network was arriving at its conclusions, you’d already contribute to the area of “Network Provability”, which is sorely lacking in Deep Learning. It could also have potential as an alternative to GANs.

Of course for me, the greatest appeal of HTM theory is both its biological foundations, and that it can potentially use much smaller amounts of training input. This second aspect gives it a leg up on areas where millions of samples just don’t exist for a Deep Learning approach. Also, backprop is super inefficient, its inferences are expensive, and it just wastes a lot of electricity (slowing the distribution rate of locally-based AI into daily life).

For folks on sidelines:


I have some familiarity to the topics in deep learning and I know for image recognition tasks convolutional neural networks (CNNs) are currently the best models to use, where using the deeper CNNs tends to give better accuracies.

Exactly like you said, I’m pretty much using the spatial pooler for convolutions and putting a softmax layer at the end which is trained. I’ve posted the code on my github here: and for accuracies on MNIST I received - Training Accuracy: 92.418% and Testing Accuracy: 90.660%. The on the github page explains the hyperparameters I used. I tried playing around with different numbers of columns to see how HTM performed and even with 484 columns and MNIST 16x16 I can get around 87-89% testing accuracy. Just a quick note, I know HTM is supposed to be online learning but for the sake of testing the network the testing set wasn’t learned on. Only the training set was similar to other machine learning problems.

For feeding in the data I’m binarizing the image and scaling it down or just using it as the size given. So for MNIST I have a function to binarize it down or up to NxN pixels and then creating a new csv file of it to be read in.

When you say going deeper with HTM, do you mean trying to add another layer onto it? Sort of like spatial pooling the data twice/three times/etc.? Could you clarify what you mean here?

Backfeeding the network would be an interesting experiment. I’ll have to figure out how to do that in terms of code but I understand that would help in understanding how the overall network is learning/proving its capabilities.

Agreed, HTM’s biological foundations are its most intriguing parts to myself and it’s able to generalize well to even small sample sizes.

Thank you for the help and suggestions :slight_smile:


Thank you for the insights and the quote from Jeff Hawkins. I know there isn’t hierarchy currently implemented in HTM but wouldn’t improving the temporal memory’s ability to retain more of a memory help in this kind of task? That way it can get more from an image similar to eye movement?


Would it be more applicable then to make an eye encoder that would represent the data in a more appropriate way for HTM? The encoder would probably need some temporal aspect to it so that you can look at different portions of an image (similar to moving the straw) and then build upon what the image is overall.

Could you elaborate on what you mean by the lateral methods in object recognition?

Also thank you all for the replies! :slight_smile:


If you include the location signal (pointing direction) then you have a way forward in recognition.

This is actually an improvement on the “standard” convolution method. The standard method drags the kernel over the entire image regardless of what is in that part of the picture.

In practice you could do a very coarse image analysis to determine the “clumps” of an image (non-foveal vision) and then use that to guide gaze(pointing direction) for detailed foveal image recognition. This may have the additional benefit of moderate scale invariance.

I strongly suspect that the sub-cortical structures use a small set of “canned” scan patterns based on the low-resolution shape of the “clumps” (primitives). There is considerable supporting evidence that the amygdala has certain built in shapes that drive attention (interest) and emotions. Faces are a well known example. Snakes and insects are other examples. There also seems to be a sexual dimorphism for the features for secondary sexual characteristics.


So essentially we’d need some sort of attention mechanism along with the pointing direction to improve HTM performance on vision tasks?

Yes, although that could be very simple.
If you include gaze as part of your image decomposition then “something” has direct it.

Two approaches:

  1. Multiple spatial poolers that are then concatenated together before sending out to your classifier. (sort of the macro approach)
  2. A spatial pooler would act as the input space into another spatial pooler (stacking them), so that you go from low level details (which in deep learning tend to be rudimentary edges/corners), to an abstraction.

Those two are not exclusive of each other either. I would try combinations of them.

As for my reference to a sliding window or sliding frame (used in other areas of machine learning vision problems), here’s a nice explanation of it. @Bitking, is this akin to what you’re thinking of? At least in deep learning, this seems to have assisted in the generalization abilities of networks, in addition to convolutions.

In the end, my suggestions to you are also something I’d like to try out later as well… since I stumbled into HTM though (around the time I started a company 11 months ago, resigning from a job in fintech. My site:, I’ve just been too damn busy to try applying these concepts (or even maintain my poor website). I makes me very happy to see somebody that is though. I’m hoping to have a bit more breathing room in about a month so that I can actually contribute to this space directly. In the meantime, I’m happy to give guidance/directions/suggestions as you (@flarelink et. al) push into this.


If synapse (that is, the connection mapping from your input space to a single ‘unit’ in your spatial pooler) strengthening or weakening were enabled, I imagine a system that would simultaneously filter for:

  1. non-anomalous regions (that’s to say, things the system recognizes such as boarders/edges)
  2. anomalous regions (areas for which there are few activations)

Then I’d focus my networks gaze on the familiar, assuming that we know the edges/boundaries have been decently trained into the network.

Assuming stacks of spatial poolers (I’m biased and influenced by deep learning as well as recent Numenta research into layers within the cortical columns), you would want your initial layers to be caring only for what it recognizes, while having higher layers ask the question “Do I know this?”, perhaps having a random rate of curiosity (fluctuating between caring about recognized vs. unrecognized). Maybe even have columns set aside in a plastic state (rather than a frozen trained state) specifically for pursuing the unrecognized regions, seeing if there are any new patterns.

It’s a rough idea, but keep asking for clarification where it isn’t clear what I’m saying.


H[quote=“MaxLee, post:11, topic:4582”]
is this akin to what you’re thinking of? At least in deep learning, this seems to have assisted in the generalization abilities of networks, in addition to convolutions.

No. Imagine a 2D wavelet transform implemented with a neural network; this is not exactly right but a 2D FFT is not exactly right either. The network in question is building spatial frequency dependent pools of activity. If we were not committed to biologically accurate mechanisms the FFT butterflies would be a nice start on doing the spatial frequency extraction and there is good theoretical grounding in deriving the required function. Doing it with a neural network might mean blazing a new path. (not terrible for a PHD project!)

It would build an activation sheet that is “draped” over the image where the lowest frequency energy peaks would be the “tall poles” holding up the sheet. These are the centroids of the clusters of features.

This activation profile would select the nearest match primitive shape to drive the gaze centers.


HTM’s object recognition (or an older version of it) narrows down possible identities of the object by figuring out which objects all macrocolumns agree could be the object. That’s just an example of what I mean by lateral methods, which is communication between cells in the same group with the same inputs and so forth, or if you prefer, between cells in different groups (in the same level of the hierarchy if there is a hierarchy). By group of cells, I mean e.g. the spatial pooler’s cells or the temporal memory’s cells and so forth. They might be in different macrocolumns (called cortical columns in HTM, which each receive input from a limited patch of the sensor) or not.


Careful with the temptation to simply stack SP and TM regions on top of one another. Of course it seems like the most intuitive thing to try, but it would not be a decision that has any rigorous mathematical or even biological motivation (not that much of anything in HTM has rigorous mathematical motivations). It is well known that the communication between neocortical columns who together process information in a hierarchical fashion is vastly more complex than what DL models have proposed in order to utilize backprop. I’ve always been more of a utilitarian, myself, so if something works, I don’t really care if it’s biologically motivated or not. However, my understanding of the purpose and value of HTM research itself is to always look to Biology in the effort to push AI materially ahead (which is arguably much more difficult than designing mathematical tricks on existing DL methodology). So I don’t think deviating from Biology now necessarily falls in line with everything that has led up to the current state of HTM. In other words, bandaging up an incomplete/experimental biologically-inspired model with algorithmic tricks to compete with CNNs might seem fruitless or even unnecessary. This is true when the engineer could just choose to use CNNs in the first place which have been specially designed from the very beginning for solving exactly the problem of interest (visual classification) and have tons of research/support from all over the world; whereas HTM arguably seeks to explore and answer different questions (at least that is my interpretation).

Jeff explains here more reasons why hierarchies are of less interest to Numenta:

Moreover, I’m nearly positive somebody has already experimented with stacking SP and TM regions on top of each other. I know I’ve heard it brought up many times in the past. Perhaps somebody can help me find where this has been done before.


HTM theory is still experimental and experimenting with visual data is incredibly time consuming. Vision is hard and there’s a lot of pixel-data to process. I’ve tried multiple times to make HTMs use visual data and most times I’ve had to redo the experiment with smaller and simpler datasets. My best experiments are the ones which use made-up datasets to prove theories about single aspects of the brain.


I agree with you and completely understand what the goals of Numenta are. Being a practitioner of DL, I’m also keen to apply, even if not biologically correct, whatever work best (where “best” == “good generalization” && “less training data” || “more efficient calculations” || " provability of how system arrived at conclusion ")

I support the Gospel according to Bami, but I’ll encourage anyone who feels to try applying variations of that to fit their specific situation. Maybe this thread should be moved to the “tangential” area?

My main thrust is that the inherent structure of SPs already create some pattern recognition, even without training. Combining this with some minimal training, even before HTM reaches its end goal, it can already start assisting with a wider array of real-world applications outside of strictly time series data (at which it excels). On a real level, the potential efficiency gains translate into real environmental savings. In the meantime, Numenta will continue its mission of understanding the biological functions of the brain, creating an ever more complete picture of how we learn and think. I don’t think there should be exclusivity between the two.

But it is good to have a moderating voice, and I appreciate you also taking the time to put your thoughts together into a response. Perhaps it would be appropriate to make sure we classify things as either in the “Pure HTM” vs. “HTM hybrid” categories. Ultimately, understanding how intelligence works in nature will lead to the most robust and efficient design, and it should be pursued without fail. But if we self-limited ourselves from using DL, because it wasn’t a biologically accurate model, we’d be needlessly limiting ourselves.

I don’t want to get stuck in a world of false dichotomies and artificial binary boundaries between ideas when it comes to applied solutions.

Deep Learning, or the ideas around it had been around for ~30 years before somebody was able to crack it and make it work (realizing that we just needed to have more data to process). Who’s to say that the feedback that occurs on those tangential projects doesn’t somehow assist with the main efforts of Numenta? (where some abstract observation might provide a clue to the main mission of research)

I feel we shouldn’t close off opportunities prematurely. Just because I and perhaps hundreds of others don’t succeed at something, doesn’t mean others still shouldn’t try.

One thing I observe here is a bias to form into camps (I suspect that’s just human nature), so that folks may naturally resist forming those cross-domain skill sets and knowledge bases. So somebody who has failed previously, whose only background has been neuroscience, trying to models the neurons in a strictly biological fashion (nothing wrong with that, by the way), might as a result only have the perspective of their domain. Our only real limits here are that we can’t violate the laws of math and physics. Beyond that, it’s open for exploration.

I work specifically with computer vision, heavily, for the past year, using both C++ and Python. I don’t believe I’m the only one here with this background either. But in a month’s time, I’ll be devoting solid blocks of research time to applying this area, without the strict biological constraints or complete adherence to HTM theory (as it currently publicly stands stands). Because I believe the overall efficiency over using purely backprop based systems will be worth it. After all, nobody’s up in arms over the slight divergence from strict HTM by

Feel free to point out when something isn’t pure HTM, but I do suggest avoiding discouraging folks (or even appearing to) from exploring this area. The community as a whole only loses out from that behavior.


I think you are right, thanks for pointing that out.

1 Like

So I’ve tried both of these approaches and I agree with others that stacking SPs probably isn’t a good method after seeing some results of my own.

With the first approach I have the same input images sent to two different SPs then evaluate the output of each one with softmax classification and then I dot product the SDRs of each SP and evaluate the performance again with another softmax. I’ve improved my SP implementation such that now the output of both SPs is around 94% for training and 92% for testing. However, once I dot product the output of each SP and take the softmax output I receive around 50% for training and testing.
After implementation I was thinking that dot producting the output SDRs of SPs didn’t really make sense since the permanence values were randomized for both SPs meaning that for the winner-take-all on the columns would end up having different winners for both SPs.

With the second approach where I’ve concatenated two SPs together I apply a softmax on the output of the first SP and then the second SP to evaluate performance of the network. The output of the first SP is around 94% accuracy for training and around 92.5% accuracy for testing. The output of the second SP, however, is around 80% for training and testing.

I agree with the ideas of both Pure HTM models and HTM hybrid models. I believe the approach I will be trying is a HTM hybrid model. However, I think the SP has been worked on a fair amount and I think I should look at the temporal memory ™ to see if I can add some sort of attention mechanism that may help for a task such as video processing.


@flarelink Good luck.