A flexable framework for HTM algorithms. (And another HTM implementation no one asked for)

marty1885 · February 10, 2019, 7:02am

Hi all!
After making HTMHelper. I really feel that I don’t like NuPIC. There’s lot’s of quark in the system, inconsistent API, etc. And I don’t feel that I am deep enough into HTM. So, I deiced to make my on HTM and find out what I can do with it.

tiny-htm is an implementation in C++ of both the SpatialPooler and the Temporal Memory algorithm with a twist. All data and states are stored in a N-D array and algorithm passes are expressed very abstractly; unlike what most other HTM libraries does - storing cell state and connection with their own class. This should allow faster development of new HTM algorithms as old one are easily reusable. Or even mixing algorithms; like boosting in a TM, inhibition in TM, change potential pool in SP, etc…

One key observation I made is that in HTM, what we are about the most is the connections between the cells and the cell state; the learning algorithm used in both SP and TM aren’t too different apart from each other. So… literary, the TM algorithm can be expressed in just a few lines with some highly reusable functions.

xt::xarray<bool> active_cells = applyBurst(predictive_cells_, x);
xt::xarray<uint32_t> overlap = cells_.calcOverlap(active_cells);
predictive_cells_ = (overlap > active_threshold_); 
if(learn == true) {
	xt::xarray<bool> apply_learning = selectLearningCell(active_cells);
	xt::xarray<bool> last_active = selectLearningCell(active_cells_);
	cells_.learnCorrilation(last_active, apply_learning, 0.1, 0.1);
	cells_.growSynasp(last_active, apply_learning, 0.1, 0.1);
}
active_cells_ = active_cells;
return xt::sum(predictive_cells_, -1);

Just to proof stuff works.

Edit: oops. I forget to rename my project.

I should be adding new features and adapt core from HTMHelper gradually in the future:

More encoders
Add boosting for SP
Parallelize using OpenMP or TBB
Saving the models
C++20 modules?

Anyway. Thanks everyone on the forum. I won’t be able to go so far without you guys.

marty1885 · February 17, 2019, 5:15pm

Optimization and parallel computing

I have been optimizing and adding parallel processing to the system. Now it is literately 100x faster than I initially posted. And another few more times faster with parallel computing (although the performance does not scale ideally/linearly with the number of cores)

This is the testing environment that I’m using.

Hardware	Info
Processor	AMD Ryzen 1700X (8 cores, 16 theads) @ 3.4GHz (Tubro OFF, locked at 3.4GHz for testing)
RAM	DDR4 2400MHz x2
Operation System	Arch Linux x64 (kernel 4.20)
Compiler	GCC 8.2.1
Parallel API	OpenMP

Not boring everyone with details of optimization. I’ll just show the results.

Spatial Pooler

To test the performance of my HTM implementation. I decided to measure how long does TM/SP need to perform a specific task. For the SP, I measure how long it takes to generate and learn a 256 bit representation of the input SDR of different lengths with a potential pool ratio of 0.75.

The Spatial Pooler seems don’t scale well with the number of cores are there to solve the problem. There are only generally a 2.4x speedup for 1 core vs 4 cores. While Interestingly Hyper Thread / SMT threads seems not be doing and help. Using both 8 or 16 threads yields basically the same performance on my system (As I have 8 physical cores). This might indicate that some resource shared by the two cores has been used-up totally. I also found this behavior is also present on Intel processors.

I have also tested the performance of tiny-htm vs NuPIC.cpp… I’m ~10x slower in single thread inference speed. I totally don’t know why. Seems that NuPIC.cpp is unreasonably fast. Maybe I have set something wrong for NuPIC somewhere.

Temporal Memory

I did the same measurement for Temporal Memory. I measure how long a TM needs to infer and learn a random sequence of SDRs of different size. Sadly, I’m a few orders of magnitude slower than NuPIC.cpp in this case. I truly don’t know why.However, this time the HT/SMT threads does seem to help though. Running with 16 threads are noticeably (tho not much) faster than with 8 threads.

%E5%9C%96%E7%89%87

A 4x acceleration using 8 threads is good enough for me.

tiny-htm is very under-performing compared to NuPIC.cpp. But I hope this can serve as an reference to how HTM algorithm may behave under many threads.

jacobeverist · February 17, 2019, 8:16pm

Cool work. Obviously there’s some aspect of your code that is not parallelizing in the way that you expect. Either it is not dividing up the problem properly or each thread performs the same function redundantly and increases the run-time. The other possibility is that the overhead to coordinate and reduce the thread results dominates the performance results.

I’ve never used OpenMP so I can’t give an expert opinion based on looking at your code. However, we went through the same task as you but used OpenCL. We made a lot of mistakes that took a while to discover and also took some time to optimize it properly. However, it still runs very quickly on a multi-core CPU system as well as any GPU. I’m not sure how it compares to yours, but i’d have to look more closely at your tests and see how they compare to ours.

Once we release our code, I think we could really benefit from your expertise and effort on building an API. Right now, our work is config file based. You specify modules and connections between them. It is primarily C++/OpenCL but doesn’t have the python wrapper just yet.

High performance HTM is a tough nut to crack! I will spend some more time looking at your code to see what I can learn.

Jacob

Edit: I misread your plots. For some reason I interpreted them as decreasing in performance with the number of threads. I’m glad to see I was wrong

marty1885 · February 18, 2019, 8:39am

Update:
It ended up being a bug inside my testing code when benchmarking against NuPIC.cpp. tiny-htm is ~2.43x faster than NuPIC.cpp in the TemporalMemory department (on average 164ms vs 400ms per inference + training with a random 16384 bit SDR at 15% density after 1000 time steps). Or at best 12.85x faster with 16 threads.

@jacobeverist Sounds like a awesome project you are working on. Let me know anytime if I can assist on anything.

sunguralikaan · February 18, 2019, 10:08am

Congratz! Since TM is the bottleneck for complex tasks this is good news.

dmac · February 18, 2019, 5:51pm

Processor AMD Ryzen 1700X (8 cores, 16 theads)

This means you’re using hypertheads? I think your performance graphs show the limitations of hyperthreads. You program scales perfectly with the number of cpu core used. Hyperthreads however are a way to take 1 cpu core and run two threads on it. The two threads switch off when one if them either blocka on I/O or has a cache miss and blocks on memory. If you program is compute heavy, but not memory heavy, then hyperthreads get you very little benefit.

marty1885 · February 19, 2019, 12:32am

I think the problem is that HTM is an memory intensive algorithm. I’m curious how well HTM runs on the GPU as @jacobeverist mentioned. GPU have high bandwidth high latency memory yet not much cache. Maybe HTM need another kind of processor. Who knows.

rhyolight · February 19, 2019, 12:38am

HTM definitely needs another kind of processor to be really fast. There are many companies working on “neuromorphic” chips that have the plasticity to efficiently represent growing dendrites and synapses.

Jose_Cueto · February 19, 2019, 1:06am

Great work! Could it be cache thrashing that causes the perf to not scale very well with multiple logical cores?

marty1885 · February 19, 2019, 1:22am

I think hyperthread not working is linked to all the cache misses and DRAM access. And I suspect the cause of not scaling is due to the overhead of parallelizing and the overall limited DRAM bandwidth.

Jose_Cueto · February 19, 2019, 1:27am

Yes, cache thrashing usually cause RAM access I had a similar problem before so long-story short the workaround (not a complete sol’n) was to allow the OS threads to work on closely related data.

jacobeverist · February 19, 2019, 6:33am

Can you specify how you measured your performance times so that I can replicate them with my own system? I would be useful to compare apples to apples.

160ms for 1-thread case is for what exactly? 1000 steps of all these?

encoding
spatial pooling
predictive depolarization
column activation and winner selection
proximal learning
distal learning

How many minicolumns? How many neurons per column? Is input changing with every step or is it static? Does the performance of your algorithm per step change over time? This could happen depending on your dendritic and synaptic connectivity changing over time.

Thanks

jacobeverist · February 19, 2019, 6:39am

Actually, one of the things I am trying to get done is building and deploying our OpenCL code to an FPGA. I actually think it would be more efficient on this platform since most operations are highly local and mostly adding and conditionals. GPUs are mostly designed for floating point matrix operations which aren’t really what HTM uses.

Of course, planning and doing it are two separate things

marty1885 · February 19, 2019, 7:07am

I have just get the execution time down to 25.239ms with 16 threads and 96ms with 1 threads @ 3.8GHz. (per iteration, so 25.239 seconds and 96 seconds for the entire process)

The benchmark pesudo code is as the following.

input_sdr_length = 16384
sequence_length = 1000
sdrs = [encodeScalar(value=random.rand(), min=0, max=1, result_sdr_length=input_sdr_length, num_on_bits=input_sdr_length*0.15) for _ in range(sequence_length)]
tm = TemporalMemory(input_shape=[input_sdr_length], cells_per_column=16, max_connections_per_cell=1024, permanence_inc = 0.1, permanence_dec = 0.1, initial_permanence=0.21, connected_permanence=0.15, activation_thr=2)
startTiming()
for sdr in sdrs:
    tm.burst(sdr)
    tm.calculatePredictiveCells()
    tm.selectWinningCells()
    tm.reinforceSynapses()
    tm.growNewSynapses()
    #tm.decaySynapses(some_thr) # not enabled, growth only in test.
    result = tm.getPredictedSDR() #Not the predictive cell state, it is the predicted SDR
    __compiler_do_not_optimize(result)
endTiming()
print("It takes {}ms".format(timeDelta().asMiliSeconds()/sequence_length))

No SpatialPooler is involved in the test. It’s TM only.

And run it for a few dozen times to average out the noise in measuring time.

Yes, the performance does change over time as there are more dendrites. So to be fair, we both need to train the TM with the same iteration count. My library does allow the decay/removal of neural connections, but it is not enabled in the testing and simply allow the growth until the # of connections hits the defined maxima.

jacobeverist · February 19, 2019, 3:48pm

Can you explain how you get this? Are you using some kind of classifier?

jacobeverist · February 19, 2019, 3:50pm

From your numbers, I’m inferring you have 1024 minicolumns?

marty1885 · February 19, 2019, 4:16pm

No, if a column has a predictive cell, the column outputs an 1, otherwise a zero. So there is a [m, n] (where m is the number of columns and n is the number of cells in a column) matrix that stores which cells are predictive. And the output is in each column, weather there is a predictive call in the column.

There are 16384 columns with 16 cells in each column.

Edit: I have get it down to 71 and 18.3ms with 1 and 16 threads.

rhyolight · February 19, 2019, 5:22pm

He’s getting the raw predictive cells from the temporal memory. This is the data you would run a classifier on.

marty1885 · February 22, 2019, 6:09am

@jacobeverist I have access to OpenCL capable FPGA hardware and software license from work and can use them in my spare time. May you share your code so we can get some initial results?

rogert · February 22, 2019, 1:22pm

Prompted by this pseudocode and tiny-htm’s tmbench.cpp example, I’ve made a similar HTM-scheme benchmark:
atsmbench.ss runs the learning step of HTM-scheme’s implementation of “apical tiebreak sequence memory” (without apical input, so essentially the temporal memory algorithm).

With the tiny-htm parameters (16384 minicolumns, 16 cells/column, sequence length 1000, sparsity 0.15) the iteration time is 1000ms. *

However with “biologically plausible” (?) parameters of 164 cortical columns of 100 minicolumns each and sparsity 0.1 (10 active inputs/column) the total iteration time for the same number of cells (164 x 100 x 16) is 78ms.

Column computes can be multithreaded giving an elapsed real time for the 1000 iterations of 21 seconds.

Times are for my 2013 MacBook Pro (4 cores, 8 hyperthreads, 2Ghz -> 3Ghz turbo, 1600Mhz DDR3)

(Note added later): sparsity is a significant parameter for HTM-scheme - with the commonly suggested sparsity of 0.02 iteration time for 16384 minicolumns is 17.3ms … would be interested in tiny-htm timing for this.

Topic		Replies	Views
Current Fastest HTM Implementation? Implementations	12	1126	November 22, 2019
Speed of Spatial Pooler and Temporal Memory Engineering sequence-memory , spatial-pooling , question	11	1274	May 10, 2022
Ideas about HTM concurrency Engineering concurrency	6	956	March 1, 2019
Another HTM test implementation Engineering	18	1760	April 24, 2017
Developing a better NuPIC. Need suggestions Implementations htm-implementations , community , projects	17	1534	July 4, 2018

A flexable framework for HTM algorithms. (And another HTM implementation no one asked for)

Optimization and parallel computing

Spatial Pooler

Temporal Memory

Related topics