Releasing Etaler - A very fast and flexable HTM framework with full GPU support

Hello all.
I’m very excited to finally release a part of my graduation project - Etaler, a very flexible high-performance HTM framework.


Why build Etaler

NuPIC is research orientated, and thus it is not that fast. I think that’s one of the reason why we are still stuck at small scale experiments. Inference and learning on NuPIC of a 8192x16 TM can take up to 3 seconds per time step. That’s a big problem for me working on realtime SMI (Sensor motor inference)/RL(Reinforcement Learning). Therefor I set out to make my own framework. And make sure it is fast.

Features

Etaler is designed around these concepts.

  1. Integrated Tensors
  2. Separate fronted/API and backend
  3. Data Orientated Design as first class citizen
  4. Attempts to support research

Tightly Integrated Tensors

Instead of relying on libraries like numpy or xtensor to handle multi-dimensional arrays (Both of them don’t support GPU). Etaler implements it’s own Tensors. Which are tightly integrated into the core framework. And can be easily extended. Allowing easy future development.

And… These Tensors support broadcasting, GPU acceleration and basic indexing. :smiley:

Separate fronted/API and backend

Etaler provides different backends to run on different devices. While all the backends are connected to the same frontend API. Enabling simple optimization strategies and the ability to run on the GPU with on line of code change. Like how most DL framework works.

Data Orientated Design

Like most modern Deep Learning frameworks, Etaler uses a DOD approach to it’s design, for example, instead of having a class called Synapse which stores a connection target and a permanence. Synapses are described as two Tensors, one storing the connections to other neurons and one storing permanences. Reducing the amount of memory access and increasing efficiency.

A data orientated approach also results in highly reusable API. Now writing new layers is like writing tensor operators in DL frameworks - just write some code that chains the operations together. No more inheritance hell when developing new layers!

Supporting research and innovation

Being an very ambitious goal. I designed Etaler to support future research and communication between researchers and developers/hobbyists. By supplying a clean and high-performance interface, researchers need not to invent their hacks to have the system running at speed while devs/hobbyists shares their ideas with clear, expressive code. Less loop, less conditions, just function calls.

To support the previous two clams, TemporalMemory with apical synapses can be implemented in just a few lines.

auto compute(const Tensor& x, const Tensor& apical, const Tensor& last_state)
{
	et_assert(x.dimentions()  == 1);//This is a 1D implementation

	//Feed forward TM predictions
	auto [pred, active] = TemporalMemory::compute(x, last_state);

	//Apical feedback
	Tensor feedbacks = cellActivity(apical, apical_connections, apical_permance, 0.21, 3);

	//Cells only  predict if it feedback is active when there is more than 1 cells active in a column
	auto s = pred.sum(1);
	pred = pred&&(s>1 && feedbacks) || (s==1 && pred);
	return {pred, active};
}

void learn(const Tensor& active_cells, const Tensor& apical, const Tensor& last_active)
{
	//Let the distal synapses grow and learn
	TemporalMemory::learn(active_cells, last_active);

	//Let the apical synapses learn
	learnCorrilation(apical, last_active, apical_connections, apical_permance, 0.1, 0.1);
	growSynapses(apical, last_active, apical_connections, apical_permance, 0.21);
}

Performance

On CPU, Etaler is as fast as if not slightly faster my previous framework, tiny-htm. On an adequate GPU, Etaler outperforms any other framework by a huge margin.

Performance of SpatialPooler:
(9000 input bits, 9000 output bits, 10% target density, no boosting, no topology, 75% potential pool, potential radius = 4500, leaning enabled, random scalar encoded input)

NuPIC.core tiny-htm Etaler
Processor R7 1700X - 1 core R7 1700X - 16 cores RTX 2080Ti
Time 181ms 25ms 6.9ms

image

Performance of Temporal Memory:
(8192 columns, 16 cells per column, max 1024 synapses per column. tiny-htm and Etaler using the connect-to-all method described here)

NuPIC.core tiny-htm Etaler
Processor R7 1700X - 1 core R7 1700X - 16 cores RTX 2080Ti
Time 3074ms 36.8ms 4.91ms

image

  • NuPIC.core compiled with clang 8.0.0 (running into linking issues with GCC)
  • tiny-htm compiled with GCC 8.3.0
  • Etaler compiled with GCC 8.3.0
    • RTX 2080Ti benchmarked using Nvidia’s official OpenCL SDK.

OS/Device support

Etaler has only been tested and proven to work on the following systems. According to the results, there is no reason to not work on others systems.

system1 system2 system3 system4 MacBook Air
OS Arch Linux Arch Linux Arch Linux Manjaro Linux OS X
CPU R7 1700X I7 8700 I5 8250U I7 6700 I5 5250U
GPU GTX 780Ti RTX 2080Ti HD 520 GTX 970 HD 6100
  • For Intel iGPUs, tesed on the new Neo OpenCL SDK
  • For NVIDIA GPUs, tested on both official OpenCL SDK
  • OpenCL on CPU is not tested.
  • I want to test Etaler on an AMD card, but not having the budget to get one. :frowning:
    • Radeon VII (7nm Vega + HBM) theoretically should be faster than a RTX 2080 since the main limitation is the memory bandwidth.
  • Etaler is build-able under Windows using MSYS2. But crashes immediately due to a MSYS2 specific compiler bug (the resulting DLL is not loadable). Other build environments are not tested.
  • I see no reason not working on ARM.
  • Built with GCC and stdc++ on OS X.

Future plans

I’m planing on long term supporting/developing Etaler after my graduation project if there is enough interest. There are some features I want to add to Etaler but don’t have the time to do so yet.

  • port htmresearch layers
  • More optimization (2x performance on GPU is may be feasible)
  • Python wrapper
  • More numpy-style array operators
  • Graph more/lazy evaluation
  • Better documents
  • Support ongoing research
  • Batch execution (optimization for Thousand Brains Theory)
  • Windows support
  • etc…

I believe Etaler can be a great tool pushing HTM theory forward, accelerating experiments and promoted innovation. But it is in it’s early stages and I’m not going to make it just by myself. If you think Etaler is a project you are interested in, by all means.

The easiest way to help is by using the library. If you find a bug or a missing feature, please open an issue and let us know how we can improve. If you want to develop the library it self, you are very welcomed! We are excited to see a new PR pop up.

Contribution

HTM Theory is in it’s young age and as we are growing. We’d like to get contributions from you to accelerate the development of Etaler! Just fork, make changes and launch a PR!

Special Thanks

A very huge thanks to @LiorA testing the framework, making a Dockerfile and now working on Layer visualizations. Thank you! Your work is amazing.

Also, thank you to all forum/Discord members. I won’t be able to go so far without you. I have learned a lot from the community and I love you awesome geeks!

Where is the source!

(I’m still working on the logo for the projerct. Hang on!)

And extra examples.

16 Likes

Looks promising. Does it also have opf with anomaly detection or is the plan to just the port of nupic.core ?

2 Likes

It is a entire new framework. Having nothing to do with NuPIC.core. But it is designed in a way that is very close to how a ordinary DL frameworks. So when we eventually have Python wrapper working, you can slap any data streaming framework for DL and it should work.

4 Likes

Got it. Thanks :slight_smile:

Nice work, @marty1885 – thanks for sharing!

I thought I would mention a few observations on differences between Etaler and the official TM algorithm. I am not claiming any of this to be wrong – just some points worth noting. I am not very familiar with the specific .cl file syntax, so please correct me if I have misrepresented anything.

  1. The TM algorithm appears to only be growing one segment per cell. This is also implied by some of the variable names (such as MAX_SYNAPSE_PER_CELL). @scott commented on this optimization idea on another unrelated thread. He mentioned that this is a reasonable optimization and that the algorithm would still work. The cost would be increased likelihood of false positives once you get up to around 10-15 predicted patterns, and he mentioned that this could be mitigated with the right learning rates.

  2. The cells chosen for learning in the current timestep will grow synapses with any cells that were active in the previous timestep (see here for example). This differs slightly from the official algorithm, where they will grow synapses only with winner cells that were active in the previous timestep (see here for example).

  3. It appears that every learning cell will connect with every previous active cell, up to MAX_SYNAPSE_PER_CELL. This differs from the official algorithm, where the number of new synapses grown in a single timestep can be throttled with maxNewSynapseCount. While I haven’t tested this, I would expect that not having maxNewSynapseCount might lead to behavior similar to what is described in this thread. If so, this could probably be mitigated by keeping MAX_SYNAPSE_PER_CELL sufficiently low.

6 Likes

Sorry for the late reply. I was too busy lately.

  1. Yes, I think that is due to the fact that I follow the pesudo code from from the original HTM Whitepaper.

  2. Yes, I use a modified version of the learning algorithm to avoid the repeated input problem that is plaguing TM. Thus it is advised to decay unused synapses every while to clean space out for new ones.

  3. Hmm… Good point that I didn’t catch. I’ll think about it.

1 Like

Congrats on your graduation! I’m a bit confused though, why didn’t you use an existing tensor library?

1 Like

A few reasons. First, most of them run on CUDA with no OpenCL support. But with OpenCL I can run on any GPU and even FPGA. Secondly, I can’t find a tensor library that both a) supports multiple GPU and b) allow easy extension (so I can add HTM algorithms to the tensor operators).


Edit: I’m still a while from graduation. It’s a must-do project, but I’m still at least a year from graduation.

1 Like