Strategy for Concurrent HTM Implementation

Paul_Lamb · February 19, 2018, 10:02pm

I had a thought recently about a possible strategy for adding concurrency to the classic HTM algorithms for use in multi-core processing. Could also be worked into a networked solution for horizontal scaling without breaking the bank (large-scale HTM on a Raspberry Pi cluster like this for example). Thought I would see if anyone has gone down this path and maybe identify some potential gotchas.

The basic idea would be to take a cluster of shards each process a sub-set of cells or minicolumns in parallel, and assemble their outputs into a completed SDR. This would be done by a leveraging a module which I’ll call the Sparse Representation Controller (SRC) which takes chunks of a representation and reassembles them:

An SRC would act as a message bus with receivers and transmitters. Shards that need to know complete activation SDRs would register as receivers, and related shards would register as transmitters to report their output. Once an SRC receives all outputs from transmitting shards, it would construct an activation SDR and transmit it to all registered receivers. Because only the resulting activation SDR is transmitted, the size of traffic within the cluster is small, and most of the processing within a shard can happen in parallel with a relatively small window required for synchronization.

A typical setup for sequence memory would look something like this (where each box could execute in parallel, and clusters could be broken into any number of shards)

The encoder would transmit its output to an SRC which is responsible for the SDR representing active input cells. The spatial pooling process would be sharded (in this example each shard is responsible for half the minicolumns). They score just the minicolumns they are responsible for, then report their winners to a second SRC responsible for assembling the “active minicolumns” SDR.

The TM process would also be sharded (in this example, each shard responsible for a 25% of the minicolumns). The TM shards would be responsible for all their minicolumns’ cells and dendrites. They would receive the “active minicolumns” SDR from the SRC above, and perform the TM algorithm on only the minicolumns they are responsible for. They would report the activated cells to a third SRC responsible for assembling the “active layer cells” SDR. This would then be fed back to the TM shards for use in the TM algorithm (i.e. previously active cells)

abshej · February 20, 2018, 2:35am

Would certainly be implementing this in the Julia implementation at some point. Though, maybe quite differently.

blue2 · February 20, 2018, 4:41am

Is the SP in your pic learning or just inferring? If it was learning
there’d also have to be data flowing back into it (‘up’ in the picture).
In any case the SP is basically all matrix-vector multiplication.
Distributing this is common practice.

Paul_Lamb · February 20, 2018, 9:27am

Well my original idea was that learning would be localized within each shard. This does imply that the output will always be evenly distributed across the shards, though, which has implications for topology. I haven’t personally had a need for topology, but you are right that solving that problem would require feedback from the SRC in order to choose winning columns that may be weighted more heavily in one shard compared to the others.

blue2 · February 20, 2018, 12:31pm

So you want local inhibition instead of global. That’s a concern that’s
separate from “topology”. I think there is a cost to local inhibition
but it may work well enough.

Paul_Lamb · February 20, 2018, 1:25pm

Topology aside, I don’t think functionally there would be a difference. The initial potential connections established in SP are random. If you are wanting 2% sparsity, for example, does it make a difference that the minicolumns chosen are distributed more evenly across the layer? Is there a capacity concern, or am I missing something important in my understanding of the SP process?

blue2 · February 20, 2018, 1:40pm

Well with your “evenly distributed” approach you’re limiting yourself to
a subset of all possible SDRs. I think there will be a quality penalty
but that still has to be quantified. I’m not saying the impact is
significant, especially if the number of shards is low.

Paul_Lamb · February 20, 2018, 1:43pm

Ah, yes of course. I had a feeling this would impact capacity, which looks like it would.

I’m thinking fixing this problem and topology would involve a specialized “SRC” for the SP process. The shards would do scoring and report their scores to the SRC, and it would select the winners and report them back to the shards to perform learning. I’ll need to think of a better name for this module so it isn’t confused with the other SRCs…

Paul_Lamb · February 20, 2018, 4:47pm

Expanding on this solution a bit more, the amount of traffic transmitted within a spatial pooling cluster could be reduced by having the SP shards only report their top (sparsity * shard count) scores to this Spatial Pooling Controller (SPC). The SPC would join those and clip to just the top (sparsity) minicolumns. It would then report back only those winners which are relevant to each shard for learning.

blue2 · February 22, 2018, 11:36am

Btw regarding your pic. There is a convention here that the data moved
"upwards" for feed forward and “downwards” for feedback. It like having
north up on a map. Your pic is the wrong way round.

Paul_Lamb · February 22, 2018, 12:05pm

dotsteve · February 23, 2018, 12:43pm

Best use of emoji in a sentence I laughed out loud. blue2 is right about the orientation though. I don’t see a reason your strategy couldn’t be made to work but the timing overhead might have to be quite disciplined to keep it from eating the advantages gained.

Paul_Lamb · February 23, 2018, 1:28pm

Yes, that has been my concern with adding concurrency to HTM for some time. I have other strategies, but they tend to deviate from the classic algorithms. I think what makes this one different is that only the activation SDRs are transmitted as compressed array of indexes to the 1 bits (granted the SP cluster must transmit more information due to the problem @blue2 described). Limited information transfer means the window for synchronization between processes can be kept very small. Timing is essentially taken care of by the SRC which transmits the completed output only once it has been fully assembled.

I’ve started writing out a proof of concept in Golang to test the idea out. Will report my findings, however it turns out.

Paul_Lamb · February 23, 2018, 3:46pm

Here is an updated diagram capturing the changes discussed so far.

MaxLee · February 23, 2018, 4:26pm

I’ve been going deep into computer vision over the past couple weeks, using the openCV library. I don’t know what others here have experience with, but I strikes me that we should be able to store layers as either binary (black/white, if you will) or greyscale images that could easily be passed back and forth, with optimized bitwise operations already existing for photo manipulation. Layers can then be passed as lossless PNG or compressed JPG.

Also, I haven’t done much with it before, but I do see that tensorflow (which has good distributed graph support), does have bitwise operations as well.

https://www.tensorflow.org/api_docs/python/tf/bitwise

Has anybody already tried this?

( I first got into openCV using this tutorial series, if anyone is interested: https://www.youtube.com/watch?v=Z78zbnLlPUA&list=PLQVvvaa0QuDdttJXlLtAJxJetJcqmqlQq )

Paul_Lamb · February 23, 2018, 4:32pm

Excelent idea. In this case, it could be a way to further compress the traffic within the spatial pooling cluster. It would also be interesting to see if a sparse (say 2%) array encoded as a PNG is more compressed than a dense array of indexes to the 1 bits. My suspicion is no, but I’ll have to check into this…

d-led · February 25, 2018, 5:36pm

Such partitioning looks like a perfect fit for the Actor Model. Each independent activity (SDR/pooler) could be modeled as an actor, even within one machine. The communication overhead on one machine would be negligible. Scaling out to other machines would be natural as well as messaging is transparent in the actor model.

The concurrency would not need complex synchronization given a good actor model implementation. The scheduler would automatically utilize available cores. (E.g. see Pony and Erlang scheduler details).

To communicate between the nodes, low latency zeromq could be used, serializing with msgpack, thus allowing for heterogeneous implementations.

Sorry for jumping in out of the blue. Will try to catch up with the discussion.

d-led · November 1, 2018, 9:26pm

I’ve set out to try this in Pony. Current strategy is to get synchronous basics off the Go implementation, and then see, how actors could fit in. → https://github.com/d-led/htm.pony
If anyone is willing to learn pony via Go + HTM, contributors are welcome

jordan.kay · November 16, 2018, 11:55pm

I’ve read through some of the Pony tutorial, and it sounds good like a great match for the idea. What have you completed so far on the htm.pony project and what more needs to be done? I looked for board on github but I didn’t see any features outline or completed so I don’t know what has been done and what needs to get done.

MaxLee · November 17, 2018, 1:30am

Ditto that. I’d be happy to contribute if somebody write up some outlines with some milestones.

Topic		Replies	Views
Ideas about HTM concurrency Engineering concurrency	6	956	March 1, 2019
Distributed HTM? NuPIC Community Fork question	13	1480	June 1, 2020
A flexable framework for HTM algorithms. (And another HTM implementation no one asked for) Implementations htm-implementations	28	2155	March 2, 2019
The HTM Spatial Pooler: a neocortical algorithm for online sparse distributed coding Related Papers	49	4507	November 25, 2019
Erlang implementation of the spatial pooler Engineering	1	1049	September 8, 2017

Strategy for Concurrent HTM Implementation

Related topics