Can you briefly (for the TemporalMemory since you haven’t applied it yet to the SpatialPooler) - outline or list what methods / tasks you are applying concurrency to? I’ve done some thinking about this a long time ago, and I would be interested to see how you organized the parallelism?
The SP is in fact parallelized too as a side effect of parallelizing the TM. But SP is so fast by itself so I didn’t put it on the forum.
The design of tiny-htm is that there is a class central Cells that stores all the critical values and connections. And it handles most of the learning logic. While layer states (input, output, predictive/active cells) are passed in as parameters. Layers like TM and SP simply wrap around this class and calling methods in Cells in different order will lead to different layer behaviour.
As I have described in the first post of this thread. TM is expressed in a few reusable functions. So does SP. So by parallelizing and optimizing theses shared functions, both algorithms are accelerated.
The parallelized functions are (basically every computation heavy method)
Cells:: growSynapse // create new connections from specified cells to cells
Cells:: sortSynapse // sort the connection in each cell in access order to increase cache hit rate
Cells:: decaySynapse // remove synapses that is too weak
globalInhibition // select the top N cells
applyBurst // burst columns if no cell in columns is on
selectLearningCell // the reverse of applyBurst
The current parallelizing strategy in simple (since HTM requires a sequence of steps that are dependent on each other. Not much I can do here). I just parallelize the large loop inside those functions. OpenMP itself is a thread pool so there’s minimal overhead (but still causing slowdowns at small work size).
Edit: Loops scheduling turned out to be an important aspect. Simply splitting the loop into N parts and run it on every threads will cause some threads waiting for others. Yet letting each thread pick one iteration then the next introduces too much overhead.
@marty1885 can you please link your post about parallelisation results? I remember seeing the graphs, but I cannot find it now.
Since this is a hackers’ subforum, I’d like to discuss some implementation concerns we’ve come up with.
We separate parallelization options into 3 levels:
low level (implicit): c++17 TS:Parallel can run some select routines
manual: what @marty1885 did here, +most benefits for single-core task, -complicates code,…
high leve: NetworkAPI level: it’ll be relatively trivial to run whole region as a separate thread, having a network, this also achieves the best utilization (no independent tasks)
fourth option is a fully asynchronous HTM, where each cell computes autonomously. That would be the closes biological implementation, and actually quite easy to implement programatically. Unfortunately, the current PC architecture would not perform well under such heavy thread switching. But there are new HW computation concepts coming, so this implementation might prove feasible in the future.
I’ve optimized my code further so the actual number should be a lot lower. But I’m on vacation now and I’ve shutdown my workstation. It will be a few days until I can show the latest numbers.
Regarding to the level of parallelism. The ideas are great! Maybe high level parallelism and async HTM will ended up with the same issue? The CPU is trying to access many different locations of DRAM and flushing the cache all the time in both cases. It might be a good idea when we finally get GPU support or hardware accelerators for HTM (any one interested?) with their dedicated RAM.