Multi-threading optimization in NuPIC


#1

Let’s discuss this here. I’m not sure we really want to do this. And if we do, what exactly does “Support for multi-threaded run” mean?

  • Option #1: Multi-threaded network, where all nodes in the network have their own thread.
  • Option #2: Multi-threaded within a node itself, which is probably easier.

There are probably more options, too. In The NuPIC Network API (video), Subutai mentioned that we have had parallelism in the Network API in the past, but it was very hard to keep working properly and was a cause of a lot of bugs.

Do we really want to add more complexity? I think the Network API is architected in a way that allows the complexities of parallelism to be offloaded to the host architecture. Let’s discuss here. I’ve locked the ticket above for new comments for now because I want to first resolve what it means.


#2

It means using <threads> (new in C++11 standard, formerly e.g. OpenMP) and allowing …well, certain methods to run in parallel on multiple threads (all common processors are MT)

Option #2: Multi-threaded within a node itself, which is probably easier.

It is not easier, but harder to implement. But offers improvements even when the single node is used as a stand-alone (e.g. SP-only could still run on multiple threads). It would also offer greater parallelism on truly many-core systems.

I think the Network API is architected in a way that allows the complexities of parallelism to be offloaded to the host architecture.

I think currently the network API has nothing to do with parallel run, and whole NuPIC runs single threaded, which is a shame in terms of performance.


#4

Regarding this. In my view, concurrency within a Network node containing a specific algorithm’s region is sparsely possible. The TM itself is a sequential being, and thus has sequential ordered dependency. Inputs must be processing in order of occurrence and there is no speed optimization available natively within the algorithm. There are parts which are able to be parallelized such as the inhibition phase of the SP, and the Encoder’s processing - but I don’t see any other opportunity?

External to the algorithms, the Network as it is relevant to an application could have parallel models run to possibly aggregate inputs not sequentially dependent for their inferences.

However, there is a new paradigm called the LMAX Disruptor which is a new threading model currently used in C++ applications. That could possibly speed up even sequential dependent processes? It could be a consideration?


#5

I can provide some context on:

One example where this would be possible and fruitful is in the Temporal Memory. The main loop over the “excited columns” can be parallelized. The iterations of this loop are independent of one another. So you could imagine a future TM that has 1-worker-thread-per-CPU, and it walks the columns and distributes them to the worker threads. This could all happen internally – the network API would be oblivious to the fact that the TemporalMemory::compute method had used multiple threads. Of course, maybe it wouldn’t be faster at all – maybe the single-threaded approach makes better use of the CPU cache.

We haven’t raced to parallelize the TM, partly because it’s currently relatively fast (e.g. if hotgym takes 2 minutes, roughly 4 seconds of that time is spent in the TemporalMemory).

Anyway, I figured an example would be useful.


#6

Whoa! :open_mouth: I’ll need to read more, but looks damn interesting! Thanks David. One thing that could be problematic is it’s still in devel (and is not a part of the standard)…so compatibility, multiplatform etc.


#7

Jumping in a bit late. Disclaimer: I haven’t dug into Nupic code yet, however there’s one thought I’d like to share. If the brain works in parallel (each neuron, I suppose), modelling it in a sequential language may lead to a sub-optimization of the CPU resource usage. Objects (columns?) in an HTM could all be seen as running in parallel, if I’m not mistaken. These do not share any data with each other directly. This matches superbly to the Actor Model of computation.
http://worrydream.com/refs/Hewitt-ActorModel.pdf
there’s an emerging implementation of the actor model compiled to native code:
https://www.ponylang.org/
there’s a nice blog article about the choice for using this language in data processing:
https://blog.wallaroolabs.com/2017/10/why-we-used-pony-to-write-wallaroo/

So, modelling each SDR as one actor, in my view, could be the road to optimizing the computation at least on one physical machine. The Actor Model helps distributed computing, as there’s no conceptual difference between concurrency and distribution: messages are sent, received and accepted upon, transparently, whether these come from the same machine or not.

The Disruptor is a good step towards the actor model, however it models few concurrent processes, whereas in the Actor Model each neuron/column could be modeled as an actor, scheduled independently.

Thoughts?


#8

@rhyolight the purpose of the Actor in that context, in my view, would be to prevent concurrency bugs, while enabling parallelism

P.S. Wallaroo binds their Pony based kernel to Python as well. + Pony can call C++, which means, it could be used for concurrency without rewriting everything