Releasing Etaler - A very fast and flexable HTM framework with full GPU support

Hello all.
I’m very excited to finally release a part of my graduation project - Etaler, a very flexible high-performance HTM framework.

Why build Etaler

NuPIC is research orientated, and thus it is not that fast. I think that’s one of the reason why we are still stuck at small scale experiments. Inference and learning on NuPIC of a 8192x16 TM can take up to 3 seconds per time step. That’s a big problem for me working on realtime SMI (Sensor motor inference)/RL(Reinforcement Learning). Therefor I set out to make my own framework. And make sure it is fast.


Etaler is designed around these concepts.

  1. Integrated Tensors
  2. Separate fronted/API and backend
  3. Data Orientated Design as first class citizen
  4. Attempts to support research

Tightly Integrated Tensors

Instead of relying on libraries like numpy or xtensor to handle multi-dimensional arrays (Both of them don’t support GPU). Etaler implements it’s own Tensors. Which are tightly integrated into the core framework. And can be easily extended. Allowing easy future development.

And… These Tensors support broadcasting, GPU acceleration and basic indexing. :smiley:

Separate fronted/API and backend

Etaler provides different backends to run on different devices. While all the backends are connected to the same frontend API. Enabling simple optimization strategies and the ability to run on the GPU with on line of code change. Like how most DL framework works.

Data Orientated Design

Like most modern Deep Learning frameworks, Etaler uses a DOD approach to it’s design, for example, instead of having a class called Synapse which stores a connection target and a permanence. Synapses are described as two Tensors, one storing the connections to other neurons and one storing permanences. Reducing the amount of memory access and increasing efficiency.

A data orientated approach also results in highly reusable API. Now writing new layers is like writing tensor operators in DL frameworks - just write some code that chains the operations together. No more inheritance hell when developing new layers!

Supporting research and innovation

Being an very ambitious goal. I designed Etaler to support future research and communication between researchers and developers/hobbyists. By supplying a clean and high-performance interface, researchers need not to invent their hacks to have the system running at speed while devs/hobbyists shares their ideas with clear, expressive code. Less loop, less conditions, just function calls.

To support the previous two clams, TemporalMemory with apical synapses can be implemented in just a few lines.

auto compute(const Tensor& x, const Tensor& apical, const Tensor& last_state)
	et_assert(x.dimentions()  == 1);//This is a 1D implementation

	//Feed forward TM predictions
	auto [pred, active] = TemporalMemory::compute(x, last_state);

	//Apical feedback
	Tensor feedbacks = cellActivity(apical, apical_connections, apical_permance, 0.21, 3);

	//Cells only  predict if it feedback is active when there is more than 1 cells active in a column
	auto s = pred.sum(1);
	pred = pred&&(s>1 && feedbacks) || (s==1 && pred);
	return {pred, active};

void learn(const Tensor& active_cells, const Tensor& apical, const Tensor& last_active)
	//Let the distal synapses grow and learn
	TemporalMemory::learn(active_cells, last_active);

	//Let the apical synapses learn
	learnCorrilation(apical, last_active, apical_connections, apical_permance, 0.1, 0.1);
	growSynapses(apical, last_active, apical_connections, apical_permance, 0.21);


On CPU, Etaler is as fast as if not slightly faster my previous framework, tiny-htm. On an adequate GPU, Etaler outperforms any other framework by a huge margin.

Performance of SpatialPooler:
(9000 input bits, 9000 output bits, 10% target density, no boosting, no topology, 75% potential pool, potential radius = 4500, leaning enabled, random scalar encoded input)

NuPIC.core tiny-htm Etaler
Processor R7 1700X - 1 core R7 1700X - 16 cores RTX 2080Ti
Time 181ms 25ms 6.9ms


Performance of Temporal Memory:
(8192 columns, 16 cells per column, max 1024 synapses per column. tiny-htm and Etaler using the connect-to-all method described here)

NuPIC.core tiny-htm Etaler
Processor R7 1700X - 1 core R7 1700X - 16 cores RTX 2080Ti
Time 3074ms 36.8ms 4.91ms


  • NuPIC.core compiled with clang 8.0.0 (running into linking issues with GCC)
  • tiny-htm compiled with GCC 8.3.0
  • Etaler compiled with GCC 8.3.0
    • RTX 2080Ti benchmarked using Nvidia’s official OpenCL SDK.

OS/Device support

Etaler has only been tested and proven to work on the following systems. According to the results, there is no reason to not work on others systems.

system1 system2 system3 system4 MacBook Air
OS Arch Linux Arch Linux Arch Linux Manjaro Linux OS X
CPU R7 1700X I7 8700 I5 8250U I7 6700 I5 5250U
GPU GTX 780Ti RTX 2080Ti HD 520 GTX 970 HD 6100
  • For Intel iGPUs, tesed on the new Neo OpenCL SDK
  • For NVIDIA GPUs, tested on both official OpenCL SDK
  • OpenCL on CPU is not tested.
  • I want to test Etaler on an AMD card, but not having the budget to get one. :frowning:
    • Radeon VII (7nm Vega + HBM) theoretically should be faster than a RTX 2080 since the main limitation is the memory bandwidth.
  • Etaler is build-able under Windows using MSYS2. But crashes immediately due to a MSYS2 specific compiler bug (the resulting DLL is not loadable). Other build environments are not tested.
  • I see no reason not working on ARM.
  • Built with GCC and stdc++ on OS X.

Future plans

I’m planing on long term supporting/developing Etaler after my graduation project if there is enough interest. There are some features I want to add to Etaler but don’t have the time to do so yet.

  • port htmresearch layers
  • More optimization (2x performance on GPU is may be feasible)
  • Python wrapper
  • More numpy-style array operators
  • Graph more/lazy evaluation
  • Better documents
  • Support ongoing research
  • Batch execution (optimization for Thousand Brains Theory)
  • Windows support
  • etc…

I believe Etaler can be a great tool pushing HTM theory forward, accelerating experiments and promoted innovation. But it is in it’s early stages and I’m not going to make it just by myself. If you think Etaler is a project you are interested in, by all means.

The easiest way to help is by using the library. If you find a bug or a missing feature, please open an issue and let us know how we can improve. If you want to develop the library it self, you are very welcomed! We are excited to see a new PR pop up.


HTM Theory is in it’s young age and as we are growing. We’d like to get contributions from you to accelerate the development of Etaler! Just fork, make changes and launch a PR!

Special Thanks

A very huge thanks to @LiorA testing the framework, making a Dockerfile and now working on Layer visualizations. Thank you! Your work is amazing.

Also, thank you to all forum/Discord members. I won’t be able to go so far without you. I have learned a lot from the community and I love you awesome geeks!

Where is the source!

(I’m still working on the logo for the projerct. Hang on!)

And extra examples.


Looks promising. Does it also have opf with anomaly detection or is the plan to just the port of nupic.core ?


It is a entire new framework. Having nothing to do with NuPIC.core. But it is designed in a way that is very close to how a ordinary DL frameworks. So when we eventually have Python wrapper working, you can slap any data streaming framework for DL and it should work.


Got it. Thanks :slight_smile:

Nice work, @marty1885 – thanks for sharing!

I thought I would mention a few observations on differences between Etaler and the official TM algorithm. I am not claiming any of this to be wrong – just some points worth noting. I am not very familiar with the specific .cl file syntax, so please correct me if I have misrepresented anything.

  1. The TM algorithm appears to only be growing one segment per cell. This is also implied by some of the variable names (such as MAX_SYNAPSE_PER_CELL). @scott commented on this optimization idea on another unrelated thread. He mentioned that this is a reasonable optimization and that the algorithm would still work. The cost would be increased likelihood of false positives once you get up to around 10-15 predicted patterns, and he mentioned that this could be mitigated with the right learning rates.

  2. The cells chosen for learning in the current timestep will grow synapses with any cells that were active in the previous timestep (see here for example). This differs slightly from the official algorithm, where they will grow synapses only with winner cells that were active in the previous timestep (see here for example).

  3. It appears that every learning cell will connect with every previous active cell, up to MAX_SYNAPSE_PER_CELL. This differs from the official algorithm, where the number of new synapses grown in a single timestep can be throttled with maxNewSynapseCount. While I haven’t tested this, I would expect that not having maxNewSynapseCount might lead to behavior similar to what is described in this thread. If so, this could probably be mitigated by keeping MAX_SYNAPSE_PER_CELL sufficiently low.


Sorry for the late reply. I was too busy lately.

  1. Yes, I think that is due to the fact that I follow the pesudo code from from the original HTM Whitepaper.

  2. Yes, I use a modified version of the learning algorithm to avoid the repeated input problem that is plaguing TM. Thus it is advised to decay unused synapses every while to clean space out for new ones.

  3. Hmm… Good point that I didn’t catch. I’ll think about it.

1 Like

Congrats on your graduation! I’m a bit confused though, why didn’t you use an existing tensor library?

1 Like

A few reasons. First, most of them run on CUDA with no OpenCL support. But with OpenCL I can run on any GPU and even FPGA. Secondly, I can’t find a tensor library that both a) supports multiple GPU and b) allow easy extension (so I can add HTM algorithms to the tensor operators).

Edit: I’m still a while from graduation. It’s a must-do project, but I’m still at least a year from graduation.


Impressive work, @marty1885 ! :clap:

The API seems really nice to use and I applaud to the HW-agnostic design.

Could you give some short comparison to HTM whitepaper, or nupic, nupic.core? In terms of feature parity, custom-design decisions (like that in TM), API compatibility.

Would it be possible to integrate your work as a GPU backend to htm.core (community of nupic.core)?
We have multiplatform c++, and python wrappers. So if we can relatively easily write c++ wrapper for SP, TM then your work should nicely integrate there.

Are you using local inhibition in the SP?
Because if not, I have competitive results. Please see if I’ve replicated your benchmark correctly,

I use: 9000 inputs, 9000 outputs, SP global inh, 10% sparsity, 75% potential pool, 4500 pot radius.
TM 9000 cols (yours 8128), 16 cells/col,

If you run the benchmark:
nupic.cpp/build/scripts$ cmake ../.. && make -j8 mnist_sp benchmark_hotgym && ./src/benchmark_hotgym 1000

It prints time in secs for 1000 runs, so exactly your ms/iteration:

Init: 0.273688
Random: 0.262973
Encode: 0.0273115
SP (g): 1.69396
TM: 19.2002
AN: 0.00640459
Total elapsed time = 21 seconds

So we get SP 1.7ms/iter and TM 19ms/iter. Would you include htm.core to your comparison?

PS: I wonder what the runtimes are on FPGAs? Do you have some results?

1 Like

@breznak Thanks you! I have been closely following HTM.core’s progress. Great job on refactoring and optimizing NuPIC.core!

Sure! I’ll make some comparisons to the whitepaper and post it later on.

Well… I believe API compatibility basically doesn’t exist. NuPIC uses numpy arrays, NuPIC.core use raw pointers, HTM.core uses the SDR class and Etaler uses it’s own Tensors. Besides that, though Etaler’s API is based on NuPIC.core’s. I have made a lot of changes. Mainly

  1. Encoder/Decoders are now plain function calls. Not objects
  2. The compute() function of SP/TM is const. Learning is done by the learn() function
  3. Non-intrusive serialization instead of intrusive.
  4. Using Tensors instead of separate SDR/vector of something
  5. etc…

As much as I would like to say it is possible. I think it is mostly no, Etaler (including it’s GPU backend) have a very different way regarding how data is handled. To be specific, Etaler represents SDRs as a binary tensor, but NuPIC represents SDR as both a binary array and a sparse array. And Etaler represents synapses as two N-dimensional tensor (one for permanence and one for connection target), yet NuPIC as far as I know stores synapses in an array-of-objects.

No, I’m only doing global inhibition. Wow, your speeds are amazing! I’ll check it out in the weekends.

I haven’t get around to port my OpenCL kernels to a FPGA yet. :frowning: It is going to be very time-consuming and it’s not high on my list of priorities. I’ll get to it eventually. (And I have someone working on a low-power Verilog core for embedded systems. But that’s in a very early stage)

1 Like

Hey @marty1885, I’m getting the following error while building in my mac. I’ve just followed the instructions in the readme. I’ve already installed TBB, how is this resolving the header files?

[ 10%] Building CXX object Etaler/CMakeFiles/Etaler.dir/Algorithms/SpatialPooler.cpp.o
[ 10%] Building CXX object Etaler/CMakeFiles/Etaler.dir/Backends/CPUBackend.cpp.o
/Users/admin/mlearn/Etaler/Etaler/Backends/CPUBackend.cpp:8:10: fatal error: 'tbb/tbb.h' file not found
#include <tbb/tbb.h>
1 error generated.
make[2]: *** [Etaler/CMakeFiles/Etaler.dir/Backends/CPUBackend.cpp.o] Error 1
make[2]: *** Waiting for unfinished jobs....
make[1]: *** [Etaler/CMakeFiles/Etaler.dir/all] Error 2
make: *** [all] Error 2

Sorry for the delayed reply. I’m on a trip to another Univ to collaborate with other projects.

Hmm… how did you installed TBB? I installed it from homebrew.

I used homebrew to install it.

Hmm… I have no idea now. May you open an issue so we could keep track of the problem? And I’m on a visit so I don’t have access to a Mac now. I’ll have to solve it next week.

No need to solve it right now, I’ll have another look this weekend and create an issue when necessary.


Hi Jose,
You can also use a docker environment. Plz see it in (It’s a fork I’m using for my own tinkering) I will be merged into Martin repo soon but in the meantime you can look at the Dockerfile .


1 Like

Hey @LiorA thanks a lot.

Hi @marty1885

I know you’re busy, same here, but I’ve created some issues in etaler repo. When you have a chance please advise, or provide a fix whenever necessary.

Nice to have:

  1. 1-step install script per OS
  2. Standard docker build (e.g. Dockerfile in root dir)
  3. Docker compose file for ease of integrating other apps/systems later on

I’m happy to contribute when I’m able to consistently build this. Also, I would not focus more on the VSC setup as it is very specific and it might not work properly in other local setups. Just my 2 cents.


@Jose_Cueto Thanks!

I think it’s time for me to consider making a stable building environment and maybe have a CI to check stuff. Though having no experience about CIs, it might take a while to have that set up.

I’ll try to get OS X building working again soon.

1 Like

@marty1885 No worries im happy to setup a ci for this as soon as we get a stable build.