A flexable framework for HTM algorithms. (And another HTM implementation no one asked for)

htm-implementations

#1

Hi all!
After making HTMHelper. I really feel that I don’t like NuPIC. There’s lot’s of quark in the system, inconsistent API, etc. And I don’t feel that I am deep enough into HTM. So, I deiced to make my on HTM and find out what I can do with it.

tiny-htm is an implementation in C++ of both the SpatialPooler and the Temporal Memory algorithm with a twist. All data and states are stored in a N-D array and algorithm passes are expressed very abstractly; unlike what most other HTM libraries does - storing cell state and connection with their own class. This should allow faster development of new HTM algorithms as old one are easily reusable. Or even mixing algorithms; like boosting in a TM, inhibition in TM, change potential pool in SP, etc…

One key observation I made is that in HTM, what we are about the most is the connections between the cells and the cell state; the learning algorithm used in both SP and TM aren’t too different apart from each other. So… literary, the TM algorithm can be expressed in just a few lines with some highly reusable functions.

xt::xarray<bool> active_cells = applyBurst(predictive_cells_, x);
xt::xarray<uint32_t> overlap = cells_.calcOverlap(active_cells);
predictive_cells_ = (overlap > active_threshold_); 
if(learn == true) {
	xt::xarray<bool> apply_learning = selectLearningCell(active_cells);
	xt::xarray<bool> last_active = selectLearningCell(active_cells_);
	cells_.learnCorrilation(last_active, apply_learning, 0.1, 0.1);
	cells_.growSynasp(last_active, apply_learning, 0.1, 0.1);
}
active_cells_ = active_cells;
return xt::sum(predictive_cells_, -1);

Just to proof stuff works.
image
Edit: oops. I forget to rename my project.

I should be adding new features and adapt core from HTMHelper gradually in the future:

  • More encoders
  • Add boosting for SP
  • Parallelize using OpenMP or TBB
  • Saving the models
  • C++20 modules?

Anyway. Thanks everyone on the forum. I won’t be able to go so far without you guys.


#2

Optimization and parallel computing

I have been optimizing and adding parallel processing to the system. Now it is literately 100x faster than I initially posted. And another few more times faster with parallel computing (although the performance does not scale ideally/linearly with the number of cores)

This is the testing environment that I’m using.

Hardware Info
Processor AMD Ryzen 1700X (8 cores, 16 theads) @ 3.4GHz (Tubro OFF, locked at 3.4GHz for testing)
RAM DDR4 2400MHz x2
Operation System Arch Linux x64 (kernel 4.20)
Compiler GCC 8.2.1
Parallel API OpenMP

Not boring everyone with details of optimization. I’ll just show the results.

Spatial Pooler

To test the performance of my HTM implementation. I decided to measure how long does TM/SP need to perform a specific task. For the SP, I measure how long it takes to generate and learn a 256 bit representation of the input SDR of different lengths with a potential pool ratio of 0.75.
image

The Spatial Pooler seems don’t scale well with the number of cores are there to solve the problem. There are only generally a 2.4x speedup for 1 core vs 4 cores. While Interestingly Hyper Thread / SMT threads seems not be doing and help. Using both 8 or 16 threads yields basically the same performance on my system (As I have 8 physical cores). This might indicate that some resource shared by the two cores has been used-up totally. I also found this behavior is also present on Intel processors.

I have also tested the performance of tiny-htm vs NuPIC.cpp… I’m ~10x slower in single thread inference speed. I totally don’t know why. Seems that NuPIC.cpp is unreasonably fast. Maybe I have set something wrong for NuPIC somewhere.

Temporal Memory

I did the same measurement for Temporal Memory. I measure how long a TM needs to infer and learn a random sequence of SDRs of different size. Sadly, I’m a few orders of magnitude slower than NuPIC.cpp in this case. I truly don’t know why.However, this time the HT/SMT threads does seem to help though. Running with 16 threads are noticeably (tho not much) faster than with 8 threads.

%E5%9C%96%E7%89%87

A 4x acceleration using 8 threads is good enough for me.

tiny-htm is very under-performing compared to NuPIC.cpp. But I hope this can serve as an reference to how HTM algorithm may behave under many threads.


#3

Cool work. Obviously there’s some aspect of your code that is not parallelizing in the way that you expect. Either it is not dividing up the problem properly or each thread performs the same function redundantly and increases the run-time. The other possibility is that the overhead to coordinate and reduce the thread results dominates the performance results.

I’ve never used OpenMP so I can’t give an expert opinion based on looking at your code. However, we went through the same task as you but used OpenCL. We made a lot of mistakes that took a while to discover and also took some time to optimize it properly. However, it still runs very quickly on a multi-core CPU system as well as any GPU. I’m not sure how it compares to yours, but i’d have to look more closely at your tests and see how they compare to ours.

Once we release our code, I think we could really benefit from your expertise and effort on building an API. Right now, our work is config file based. You specify modules and connections between them. It is primarily C++/OpenCL but doesn’t have the python wrapper just yet.

High performance HTM is a tough nut to crack! I will spend some more time looking at your code to see what I can learn.

  • Jacob

Edit: I misread your plots. For some reason I interpreted them as decreasing in performance with the number of threads. I’m glad to see I was wrong :slight_smile:


#4

Update:
It ended up being a bug inside my testing code when benchmarking against NuPIC.cpp. tiny-htm is ~2.43x faster than NuPIC.cpp in the TemporalMemory department (on average 164ms vs 400ms per inference + training with a random 16384 bit SDR at 15% density after 1000 time steps). Or at best 12.85x faster with 16 threads.

image


@jacobeverist Sounds like a awesome project you are working on. Let me know anytime if I can assist on anything.


#5

Congratz! Since TM is the bottleneck for complex tasks this is good news.


#6

Processor AMD Ryzen 1700X (8 cores, 16 theads)

This means you’re using hypertheads? I think your performance graphs show the limitations of hyperthreads. You program scales perfectly with the number of cpu core used. Hyperthreads however are a way to take 1 cpu core and run two threads on it. The two threads switch off when one if them either blocka on I/O or has a cache miss and blocks on memory. If you program is compute heavy, but not memory heavy, then hyperthreads get you very little benefit.


#7

I think the problem is that HTM is an memory intensive algorithm. I’m curious how well HTM runs on the GPU as @jacobeverist mentioned. GPU have high bandwidth high latency memory yet not much cache. Maybe HTM need another kind of processor. Who knows.


#8

HTM definitely needs another kind of processor to be really fast. There are many companies working on “neuromorphic” chips that have the plasticity to efficiently represent growing dendrites and synapses.


#9

Great work! Could it be cache thrashing that causes the perf to not scale very well with multiple logical cores?


#10

I think hyperthread not working is linked to all the cache misses and DRAM access. And I suspect the cause of not scaling is due to the overhead of parallelizing and the overall limited DRAM bandwidth.


#11

Yes, cache thrashing usually cause RAM access I had a similar problem before so long-story short the workaround (not a complete sol’n) was to allow the OS threads to work on closely related data.