I've used the backpropagation algorithm to implement HTM



First off, hello HTM forum! This is my first post but I’ve been around for a while. I’ve waited until I had something to show you guys.
I’ve been interested in AI(or MI as you call it) since I was young and read the Korean version of ‘On Intelligence’ and I was very impressed. Shocked, even.
I’ve studied and implemented HTM and deep learning systems couple of times and have moderate understanding in both of them.
Then I thought, why not combine HTM and deep learning? I’m sure everyone in this forum has thought of it at least one time.
Some might think it’s not possible but deep learning is just all about using backpropagation to a function and optimizing errors by using the derivative. Since everything including HTM can be expressed as a function, so why not?

So I’ve keep tried using different approaches until it succeeded.
Of course, HTM and deep learning are so fundamentally different so it was not easy.
HTM uses binary and for deep learning, typically decimal although it’s totally possible to use binary but I thought using decimal is more interesting so I’ve stuck to it.
I’ve faced a lot of problems such as information flow between the spatial pooler and the temporal memory. But I’ve tried to stick to the HTM philosophy as possible.

I call this version of HTM ‘DeepHTM’.
I’ve implemented this using purely C++.
Using backprogation on HTM comes with several advantages:

  1. You don’t have to hardcode the encoder and the decoder(SDR classifier?).
    1-1. You can easily feed complex data types like images to HTM as you don’t have to hardcode the encoder and can even generate images from the output of HTM which will be near impossible to do with hardcoding.
  2. You can easily have a hierarchical structure with HTM.(in the plain form of HTM, it’s not so effective with just SP and TM, I think?)
  3. You can easily test a theory with deep learning as even without explicit learning rules, backpropagation will figure out what to do with the information it has. You just have to decide how the information flows.

… And so much more!

Top left: the network structure. Top right: first 1000 steps of training.
Bottom left: 4000 steps of training. Bottom right: after 10000 steps of training.
(The site doesn’t let me post with multiple images since I’m a new user so I had to do it this way.)

The network structure:
The input is sine value ranged between -1 and 1.
The encoder consists of two fully connected ReLU(it’s a deep learning thing) layers with dimensions of 256 units(cells) for each layers.
The spatial pooler consists of 256 columns with 5 winner columns using global inhibition(done by backpropagation not simple boosting) and the temporal pooler has 5 cells for each columns.
The decoder consists of two fully connected ReLU layers with 256 and 1 units for each.
The output is the predicted value of next input.
This runs about ~1500 times per second and last time I checked, doesn’t suffer from scaling problems.

Well, It’s not the best but the model was too small for deep learning standpoint(some might say It’s not even deep learning) and even for the HTM parts.
And as for the training time, I’ve used the most basic version of backpropagation and in today’s deep learning, there are nifty tricks that makes training much faster and even more stable.
It definitely needs improvements but it’s just a proof of concept.

I’m planning on implementing this to run on GPU by compute shaders and combining more deep learning stuff such as convolutional neural network.
I’m also planning on implementing a differentiable grid cell module and the sensorimotor theory.

I’m happy to answer any questions you may have! :smiley:

P.S. English is not my native language so excuse me as I might have some mistakes here and there… :’(


Hey @hsgo, cool work and thanks for sharing! I wonder, what do you see as the advantages of using backprop instead of standard HTM encoring/decoding and spatial pooling methods?

It makes total sense to me that backdrop-based methods can be useful to combine with HTM systems, like having CNNs generate encodings of images to then feed into the TM. I see the first practical limit to current HTM is its limited set of encoders, leaving out some data types like images. With scalar values however it seems that the current encoders, SP and TM system have done quite well learning them. How do you see the ReLU networks and backdrop helping here?

Also why only 5 winner columns in the SP and 5 TM cells per column? Something to do with the DNN runtime? Thanks


Thanks for replying!
Well, I forgot to mention it allows to feed complex data types like images to TM easily which you’ve pointed out here. Thanks for that and I’ll add it to the list of advantages.
I used scalar values just for proof of concept. I don’t think it changes things much. Well, It might give more semantic meanings to the input. IDK.
I used 5 winner columns because there are only 256 columns. To make the sparsity to be 2%.
As for the 5 cells per column, for this particular case, it doesn’t seem to use more than 5 cells and I wanted to debug quickly as I could.
It is true that it runs slower than vanilla HTM. That’s why I’m planning on implementing a GPU version of it. It might run even faster than vanilla HTM with GPU.


Have you tried running it on any images or videos? I’d love to see the results if you do, but I’m a bit skeptical. Wouldn’t it need a super computer to run on images/ videos, and backpropogate? Especially with a fully connected layer? Would the images/video need to be shrunk?

I’m still working on my own encoder for vision, and I was going to release soon, but I recently hit a lot of points for optimization. Since I used multiple layers, as well as variables with memory, for video and image recognition, I don’t think a ReLU layer could capture it.

That said, I’d absolutely love to see encoders produced from this! I guess you’ve made a sine encoder, but if you entered multiple frequencies, would it learn to encode the tone of a waveform for HTM?

Edit: also, you might want to see this for a GPU HTM implementation: https://github.com/calclavia/htm-tensorflow


Have you tried running it on any images or videos?

I haven’t. I even made this to just work just yesterday XD…

Wouldn’t it need a super computer to run on images/ videos, and backpropogate? Especially with a fully connected layer?

I think GPU can kinda handle it. We’ll see.
Although, I don’t think I’ll use fully connected layers for encoding images. I’m planning on using CNNs.

I guess you’ve made a sine encoder, but if you entered multiple frequencies, would it learn to encode the tone of a waveform for HTM?

I guess. That’s an interesting point! BTW, I didn’t know how to describe what I’ve done with the encoder. It was just on the tip of my tongue. And yes, just like you’ve said it. I’ve made a sine encoder!

you might want to see this for a GPU HTM implementation: https://github.com/calclavia/htm-tensorflow

Isn’t it just the spatial pooler part?


Ooh, maybe the project itself would be an auto-encoder?

Yes, but I think it could still be helpful.

Okay, that’s fair.


maybe the project itself would be an auto-encoder?

With this particular setup, It has a form of autoencoder. But I think other configurations’ll work just fine.

Yes, but I think it could still be helpful.

Thanks, but I’m planning to use compute shaders for the GPU implementation because I don’t have skills on TensorFlow.


Interesting work.

I’m trying to understand your implementation, what do you mean by,

Does the encoder try to learn the best representation of an input?
And in turn the decoder tries to learn the input from the TM output?



Yes. That’s exactly how it works.
With this setup, the TM output is predictive cells but I’ve implemented the active cell version as well.


I’m so interested in seeing how this will work on images and especially the outputs of the decoder. Most importantly how it will perform in an unsupervised environment, for example, given 10k inputs, how well will it perform with different sequences of these inputs. At least the SP memory’s state is a function of these input sequences, so different sequences will likely have different SP states/mould which may affect the learning in the encode/decode phases.


Actually, even with this setup(with only single scalar input), if bursting happens rarely, the output tends to get gradually unstable without constant learning.
It gets noticeable after hundreds of thousands iterations.
I suppose it’s because the TM’s state is based on decimal values.
Well, if it was predicting complex data like images, I guess it would get unstable way sooner.
I have an idea to resolve this problem but I think it makes the model less flexible.
Edit: I’m sorry but I think I’ve partially resolved this issue already.


Maybe, and also the DL parts are dependent with the HTM parts’ stability. For example, if the input sequences results in a robust SP memory I’d expect the DL parts will generalize well. On the other hand, if there is not enough input sequences to mould a robust SP then it will be hard for the DL parts to generalize.

But I’m just guessing based on my understanding on SP and DL. Cool!


Yup, it is kinda apparent that HTM’s sparsity affects DL’s generalization.
But I think their balance is okay so far.


I am very interested in understanding exactly what you’ve done here. Can you share your code? Would something like this run on a platform like tensorflow or pytorch where online learning is essentially disabled? @subutai is running experiments with different platforms, trying to find out where we can inject sparsity and apply HTM theory within these platforms.


I think the code is nowhere near complete and too messy to be shared because of the bunch of experiments and debugs I’ve done. But I’ll be glad to answer any questions you might have.

I don’t have any experiences in platforms like those but I think it would work but I didn’t know online learning is disable with them so I’m not sure.


@hsgo Thanks for the experiment results, I do have some questions. :slight_smile:

If I understand you correctly, you have replaced the Tempora Memory algorithm used to predict future states and replaced it with DL backprop? Have you thought about a path towards local inhibition? That’s on big pro for HTM, you can do local computations.


Thanks for the reply!

I don’t think I’ve understood that statement entirely. Forgive my English reading skills… :’(
I guess you’re saying if I have replaced the TM with DL backprop?
If so, no. I’ve just replaced the learning rule(i.e. local hebbian learning) with backprop. Everything else stays… in decimal based form.

I suppose you’re referring to local RF with local k-winner(i.e. topology?). I’m not sure how this exactly works… Especially the inhibition part. I’ve watched every HTM school episodes several times. But it went right over my head… It’s the limit of my understanding of English, actually.
But, I’m planning on another kind of local computation. It’s inspired by the CNNs. Cortical columns instead of convolutional kernels with relative coordinates to represent distal synapses. I think this method would really help with the data types that has complex spatial structure like images. Do you think this would work?


I think I understand what you are doing now. But remember that we don’t want to perform global calculations at each temporal time step, because it won’t scale to the size we need to process true topological sensory input. Being able to do the local computations is a key component of a general solution IMO. There must be feedback, there must be credit assignment, but it must be applicable locally as well as globally. Just something to think about as you build these hybrid systems.


Couldn’t agree more! :smiley:
I’m just using backprop as an useful tool to engineer ML applications.
Not to achieve the true intelligence.