Ideas about HTM concurrency

Hi @marty1885,

Can you briefly (for the TemporalMemory since you haven’t applied it yet to the SpatialPooler) - outline or list what methods / tasks you are applying concurrency to? I’ve done some thinking about this a long time ago, and I would be interested to see how you organized the parallelism?

1 Like

The SP is in fact parallelized too as a side effect of parallelizing the TM. But SP is so fast by itself so I didn’t put it on the forum.

The design of tiny-htm is that there is a class central Cells that stores all the critical values and connections. And it handles most of the learning logic. While layer states (input, output, predictive/active cells) are passed in as parameters. Layers like TM and SP simply wrap around this class and calling methods in Cells in different order will lead to different layer behaviour.

As I have described in the first post of this thread. TM is expressed in a few reusable functions. So does SP. So by parallelizing and optimizing theses shared functions, both algorithms are accelerated.

The parallelized functions are (basically every computation heavy method)

  • Cells:: calcOverlap //Calculate
  • Cells:: learnCorrilation // increment/decrement permanence
  • Cells:: growSynapse // create new connections from specified cells to cells
  • Cells:: sortSynapse // sort the connection in each cell in access order to increase cache hit rate
  • Cells:: decaySynapse // remove synapses that is too weak
  • globalInhibition // select the top N cells
  • applyBurst // burst columns if no cell in columns is on
  • selectLearningCell // the reverse of applyBurst

The current parallelizing strategy in simple (since HTM requires a sequence of steps that are dependent on each other. Not much I can do here). I just parallelize the large loop inside those functions. OpenMP itself is a thread pool so there’s minimal overhead (but still causing slowdowns at small work size).

Edit: Loops scheduling turned out to be an important aspect. Simply splitting the loop into N parts and run it on every threads will cause some threads waiting for others. Yet letting each thread pick one iteration then the next introduces too much overhead.


Hi @marty1885,

Thank you so much for your VERY thorough response - I’m at work now so I’ll have to digest this later, but I wanted to send my appreciation promptly! :wink:


@marty1885 can you please link your post about parallelisation results? I remember seeing the graphs, but I cannot find it now.

Since this is a hackers’ subforum, I’d like to discuss some implementation concerns we’ve come up with.

We separate parallelization options into 3 levels:

  • low level (implicit): c++17 TS:Parallel can run some select routines
  • manual: what @marty1885 did here, +most benefits for single-core task, -complicates code,…
  • high leve: NetworkAPI level: it’ll be relatively trivial to run whole region as a separate thread, having a network, this also achieves the best utilization (no independent tasks)
1 Like

fourth option is a fully asynchronous HTM, where each cell computes autonomously. That would be the closes biological implementation, and actually quite easy to implement programatically. Unfortunately, the current PC architecture would not perform well under such heavy thread switching. But there are new HW computation concepts coming, so this implementation might prove feasible in the future.


I’ve optimized my code further so the actual number should be a lot lower. But I’m on vacation now and I’ve shutdown my workstation. It will be a few days until I can show the latest numbers.

1 Like

Regarding to the level of parallelism. The ideas are great! Maybe high level parallelism and async HTM will ended up with the same issue? The CPU is trying to access many different locations of DRAM and flushing the cache all the time in both cases. It might be a good idea when we finally get GPU support or hardware accelerators for HTM (any one interested?) with their dedicated RAM.