HTM + OpenCL



I’ve began to write up an implementation of HTM using OpenCL to improve performance since there’s loads of places in the algorithms where we can parallelise. Any help would be appreciated (especially ideas about how the Temporal Memory can be parallelised) so issues and pull requests are welcome. Once it’s done I’ll write a paper about it as part of my PhD work.

Currently, I plan to implement:

  • Spatial Pooler
  • Temporal Memory
  • CLA Classifier
  • SDR Classifier

It’ll probably take a while since a lot of work when writing OpenCL involves experimenting to find the best ways of dividing up the workload.

Ideally it would be a drop in replacement for the existing algorithms in Nupic (with hopefully better performance).

Can I use GPU for HTM?

Might I suggest that you write this using TensorFlow rather than raw OpenCL? Just my opinion that you’ll get similar speedups with significantly less code. Since the HTM algorithms continue to evolve, it is important to strike a balance between maintainability and performance.

Note that TensorFlow uses CUDA but has experimental support for OpenCL.


I think @jbw825 has been trying to run NuPIC on TensorFlow. He might be interested in this conversation.


I’ve looked into speeding up HTM as well. I agree with @EronWright about using TensorFlow. It should take care of most of the basic set up. I think most of the work left to do would be setting up the HTM algorithms as custom tensor operations, which would have to be done in c++ and be exported to a library for use with python

Also, if you want local inhibition, learning, or other spatial algorithms to be implemented, then I think tensors should be easier to work with.


Luckily, we have a full C++ implementation in nupic.core. If python is a luxury you can live without, the door is wide open.


Yeah I saw someone else working on a TF version. The problem is that I don’t have ready access to a machine with a CUDA card. Also I don’t know TF and I have limited time. Nevertheless, it would be good to compare performance between our two implementations.


Nice, I contacted the person working on and asked them to join us and maybe describe what he’s going on our forums. Hopefully he/she shows up!


I’m working on this implementation as a way to help myself understand HTM theory, and also as a bridge to (hopefully) get more people in the deep learning community to take a look at HTM. I personally came from a deep learning background and thought it’d help to get more people interested in HTM by implementing it with the tools familiar to the deep learning community such as TF.

Current Progress: A very basic implementation of spatial pooling. Haven’t added Boosting etc.

If you have any suggestions, feel free to give me feedback! I hang around the Gitter channel as well.


Update on progress:

I implemented basic spatial pooling learning algorithm without boosting using Tensorflow. I wrote 2 basic unit tests but still need to verify the correctness of algorithm. Based my implementation off the spatial pooler paper.

Current benchmark:

  • Intel i7 i5-6600K CPU
  • Using NVIDIA GTX 1080 GPU with Tensorflow GPU
  • Feeding MNIST 784 vector input images one at a time (online learning)
  • Spatial pooling layer has 2048 cells
  • Running the spatial pooler output computation and the SP learning algorithm

Took my computer 19:39 minutes to train on 55,000 images. So every input takes around 0.021 seconds.

Are there any computation speed benchmarks for Nupic for comparison?

I believe some sparse tensor optimizations are possible, but I haven’t had a chance to really dig into it.

Using Tensorflow CPU seems to be faster with my current implementation. Takes around 8 minutes for the same dataset. GPU would probably run faster for batch operations with large tensors rather than inputting one data point at time. I may experiment with that a bit more.


There are so many permutations of parameters and input that could affect performance. I don’t think we have any benchmarks currently. It would be nice to have some.


Some updates on my progress!

Experimented with online learning:

Using the output of the spatial pooler fed into a logistic classifier results in 80.60% validation accuracy on MNIST dataset for 1024 cells. Takes around ~8 minutes per epoch.

Experimented with mini-batch learning:
Experimented with mini-batch-based learning for spatial pooler. Basically we run the SP algorithm on a batch of inputs and then update the permanence all at once. This allows parallel training (takes advantage of GPU), making computations much faster (brings each epoch down significantly from 8 minutes to ~30 seconds). I used batch-size of 32, 2048 output cells and a learning rate (AKA delta permanence) of 0.01 with 2% sparsity.

It’s a slight deviation from the HTM theory’s proposed algorithm but my brief experiments show no degradation in final accuracy. Interestingly, it achieves a 95% validation accuracy (98.37% training accuracy).

A simple linear classifier reaches 88% accuracy based on This probably suggests the SP creates a better representation of data compared to a classifier trained directly on the raw data. But I think more work is required to verify my results. Suggestions are welcome!


Good job @calclavia! It would be super exciting to see some TensorBoard visualization of this.