A flexable framework for HTM algorithms. (And another HTM implementation no one asked for)

Hi all!
After making HTMHelper. I really feel that I don’t like NuPIC. There’s lot’s of quark in the system, inconsistent API, etc. And I don’t feel that I am deep enough into HTM. So, I deiced to make my on HTM and find out what I can do with it.

tiny-htm is an implementation in C++ of both the SpatialPooler and the Temporal Memory algorithm with a twist. All data and states are stored in a N-D array and algorithm passes are expressed very abstractly; unlike what most other HTM libraries does - storing cell state and connection with their own class. This should allow faster development of new HTM algorithms as old one are easily reusable. Or even mixing algorithms; like boosting in a TM, inhibition in TM, change potential pool in SP, etc…

One key observation I made is that in HTM, what we are about the most is the connections between the cells and the cell state; the learning algorithm used in both SP and TM aren’t too different apart from each other. So… literary, the TM algorithm can be expressed in just a few lines with some highly reusable functions.

xt::xarray<bool> active_cells = applyBurst(predictive_cells_, x);
xt::xarray<uint32_t> overlap = cells_.calcOverlap(active_cells);
predictive_cells_ = (overlap > active_threshold_); 
if(learn == true) {
	xt::xarray<bool> apply_learning = selectLearningCell(active_cells);
	xt::xarray<bool> last_active = selectLearningCell(active_cells_);
	cells_.learnCorrilation(last_active, apply_learning, 0.1, 0.1);
	cells_.growSynasp(last_active, apply_learning, 0.1, 0.1);
}
active_cells_ = active_cells;
return xt::sum(predictive_cells_, -1);

Just to proof stuff works.
image
Edit: oops. I forget to rename my project.

I should be adding new features and adapt core from HTMHelper gradually in the future:

  • More encoders
  • Add boosting for SP
  • Parallelize using OpenMP or TBB
  • Saving the models
  • C++20 modules?

Anyway. Thanks everyone on the forum. I won’t be able to go so far without you guys.

9 Likes

Optimization and parallel computing

I have been optimizing and adding parallel processing to the system. Now it is literately 100x faster than I initially posted. And another few more times faster with parallel computing (although the performance does not scale ideally/linearly with the number of cores)

This is the testing environment that I’m using.

Hardware Info
Processor AMD Ryzen 1700X (8 cores, 16 theads) @ 3.4GHz (Tubro OFF, locked at 3.4GHz for testing)
RAM DDR4 2400MHz x2
Operation System Arch Linux x64 (kernel 4.20)
Compiler GCC 8.2.1
Parallel API OpenMP

Not boring everyone with details of optimization. I’ll just show the results.

Spatial Pooler

To test the performance of my HTM implementation. I decided to measure how long does TM/SP need to perform a specific task. For the SP, I measure how long it takes to generate and learn a 256 bit representation of the input SDR of different lengths with a potential pool ratio of 0.75.
image

The Spatial Pooler seems don’t scale well with the number of cores are there to solve the problem. There are only generally a 2.4x speedup for 1 core vs 4 cores. While Interestingly Hyper Thread / SMT threads seems not be doing and help. Using both 8 or 16 threads yields basically the same performance on my system (As I have 8 physical cores). This might indicate that some resource shared by the two cores has been used-up totally. I also found this behavior is also present on Intel processors.

I have also tested the performance of tiny-htm vs NuPIC.cpp… I’m ~10x slower in single thread inference speed. I totally don’t know why. Seems that NuPIC.cpp is unreasonably fast. Maybe I have set something wrong for NuPIC somewhere.

Temporal Memory

I did the same measurement for Temporal Memory. I measure how long a TM needs to infer and learn a random sequence of SDRs of different size. Sadly, I’m a few orders of magnitude slower than NuPIC.cpp in this case. I truly don’t know why.However, this time the HT/SMT threads does seem to help though. Running with 16 threads are noticeably (tho not much) faster than with 8 threads.

%E5%9C%96%E7%89%87

A 4x acceleration using 8 threads is good enough for me.

tiny-htm is very under-performing compared to NuPIC.cpp. But I hope this can serve as an reference to how HTM algorithm may behave under many threads.

5 Likes

Cool work. Obviously there’s some aspect of your code that is not parallelizing in the way that you expect. Either it is not dividing up the problem properly or each thread performs the same function redundantly and increases the run-time. The other possibility is that the overhead to coordinate and reduce the thread results dominates the performance results.

I’ve never used OpenMP so I can’t give an expert opinion based on looking at your code. However, we went through the same task as you but used OpenCL. We made a lot of mistakes that took a while to discover and also took some time to optimize it properly. However, it still runs very quickly on a multi-core CPU system as well as any GPU. I’m not sure how it compares to yours, but i’d have to look more closely at your tests and see how they compare to ours.

Once we release our code, I think we could really benefit from your expertise and effort on building an API. Right now, our work is config file based. You specify modules and connections between them. It is primarily C++/OpenCL but doesn’t have the python wrapper just yet.

High performance HTM is a tough nut to crack! I will spend some more time looking at your code to see what I can learn.

  • Jacob

Edit: I misread your plots. For some reason I interpreted them as decreasing in performance with the number of threads. I’m glad to see I was wrong :slight_smile:

2 Likes

Update:
It ended up being a bug inside my testing code when benchmarking against NuPIC.cpp. tiny-htm is ~2.43x faster than NuPIC.cpp in the TemporalMemory department (on average 164ms vs 400ms per inference + training with a random 16384 bit SDR at 15% density after 1000 time steps). Or at best 12.85x faster with 16 threads.

image


@jacobeverist Sounds like a awesome project you are working on. Let me know anytime if I can assist on anything.

5 Likes

Congratz! Since TM is the bottleneck for complex tasks this is good news.

1 Like

Processor AMD Ryzen 1700X (8 cores, 16 theads)

This means you’re using hypertheads? I think your performance graphs show the limitations of hyperthreads. You program scales perfectly with the number of cpu core used. Hyperthreads however are a way to take 1 cpu core and run two threads on it. The two threads switch off when one if them either blocka on I/O or has a cache miss and blocks on memory. If you program is compute heavy, but not memory heavy, then hyperthreads get you very little benefit.

1 Like

I think the problem is that HTM is an memory intensive algorithm. I’m curious how well HTM runs on the GPU as @jacobeverist mentioned. GPU have high bandwidth high latency memory yet not much cache. Maybe HTM need another kind of processor. Who knows.

HTM definitely needs another kind of processor to be really fast. There are many companies working on “neuromorphic” chips that have the plasticity to efficiently represent growing dendrites and synapses.

1 Like

Great work! Could it be cache thrashing that causes the perf to not scale very well with multiple logical cores?

I think hyperthread not working is linked to all the cache misses and DRAM access. And I suspect the cause of not scaling is due to the overhead of parallelizing and the overall limited DRAM bandwidth.

Yes, cache thrashing usually cause RAM access I had a similar problem before so long-story short the workaround (not a complete sol’n) was to allow the OS threads to work on closely related data.

1 Like

Can you specify how you measured your performance times so that I can replicate them with my own system? I would be useful to compare apples to apples.

160ms for 1-thread case is for what exactly? 1000 steps of all these?

  • encoding
  • spatial pooling
  • predictive depolarization
  • column activation and winner selection
  • proximal learning
  • distal learning

How many minicolumns? How many neurons per column? Is input changing with every step or is it static? Does the performance of your algorithm per step change over time? This could happen depending on your dendritic and synaptic connectivity changing over time.

Thanks

Actually, one of the things I am trying to get done is building and deploying our OpenCL code to an FPGA. I actually think it would be more efficient on this platform since most operations are highly local and mostly adding and conditionals. GPUs are mostly designed for floating point matrix operations which aren’t really what HTM uses.

Of course, planning and doing it are two separate things :slight_smile:

2 Likes

I have just get the execution time down to 25.239ms with 16 threads and 96ms with 1 threads @ 3.8GHz. (per iteration, so 25.239 seconds and 96 seconds for the entire process)

The benchmark pesudo code is as the following.

input_sdr_length = 16384
sequence_length = 1000
sdrs = [encodeScalar(value=random.rand(), min=0, max=1, result_sdr_length=input_sdr_length, num_on_bits=input_sdr_length*0.15) for _ in range(sequence_length)]
tm = TemporalMemory(input_shape=[input_sdr_length], cells_per_column=16, max_connections_per_cell=1024, permanence_inc = 0.1, permanence_dec = 0.1, initial_permanence=0.21, connected_permanence=0.15, activation_thr=2)
startTiming()
for sdr in sdrs:
    tm.burst(sdr)
    tm.calculatePredictiveCells()
    tm.selectWinningCells()
    tm.reinforceSynapses()
    tm.growNewSynapses()
    #tm.decaySynapses(some_thr) # not enabled, growth only in test.
    result = tm.getPredictedSDR() #Not the predictive cell state, it is the predicted SDR
    __compiler_do_not_optimize(result)
endTiming()
print("It takes {}ms".format(timeDelta().asMiliSeconds()/sequence_length))

No SpatialPooler is involved in the test. It’s TM only.

And run it for a few dozen times to average out the noise in measuring time.

Yes, the performance does change over time as there are more dendrites. So to be fair, we both need to train the TM with the same iteration count. My library does allow the decay/removal of neural connections, but it is not enabled in the testing and simply allow the growth until the # of connections hits the defined maxima.

1 Like

Can you explain how you get this? Are you using some kind of classifier?

From your numbers, I’m inferring you have 1024 minicolumns?

No, if a column has a predictive cell, the column outputs an 1, otherwise a zero. So there is a [m, n] (where m is the number of columns and n is the number of cells in a column) matrix that stores which cells are predictive. And the output is in each column, weather there is a predictive call in the column.

There are 16384 columns with 16 cells in each column.

Edit: I have get it down to 71 and 18.3ms with 1 and 16 threads.

He’s getting the raw predictive cells from the temporal memory. This is the data you would run a classifier on.

@jacobeverist I have access to OpenCL capable FPGA hardware and software license from work and can use them in my spare time. May you share your code so we can get some initial results?

3 Likes

Prompted by this pseudocode and tiny-htm’s tmbench.cpp example, I’ve made a similar HTM-scheme benchmark:
atsmbench.ss runs the learning step of HTM-scheme’s implementation of “apical tiebreak sequence memory” (without apical input, so essentially the temporal memory algorithm).

With the tiny-htm parameters (16384 minicolumns, 16 cells/column, sequence length 1000, sparsity 0.15) the iteration time is 1000ms. *

However with “biologically plausible” (?) parameters of 164 cortical columns of 100 minicolumns each and sparsity 0.1 (10 active inputs/column) the total iteration time for the same number of cells (164 x 100 x 16) is 78ms.

Column computes can be multithreaded giving an elapsed real time for the 1000 iterations of 21 seconds.

Times are for my 2013 MacBook Pro (4 cores, 8 hyperthreads, 2Ghz -> 3Ghz turbo, 1600Mhz DDR3)

  • (Note added later): sparsity is a significant parameter for HTM-scheme - with the commonly suggested sparsity of 0.02 iteration time for 16384 minicolumns is 17.3ms … would be interested in tiny-htm timing for this.