Strategy for Concurrent HTM Implementation

concurrency

#1

I had a thought recently about a possible strategy for adding concurrency to the classic HTM algorithms for use in multi-core processing. Could also be worked into a networked solution for horizontal scaling without breaking the bank (large-scale HTM on a Raspberry Pi cluster like this for example). Thought I would see if anyone has gone down this path and maybe identify some potential gotchas.

The basic idea would be to take a cluster of shards each process a sub-set of cells or minicolumns in parallel, and assemble their outputs into a completed SDR. This would be done by a leveraging a module which I’ll call the Sparse Representation Controller (SRC) which takes chunks of a representation and reassembles them:

image

An SRC would act as a message bus with receivers and transmitters. Shards that need to know complete activation SDRs would register as receivers, and related shards would register as transmitters to report their output. Once an SRC receives all outputs from transmitting shards, it would construct an activation SDR and transmit it to all registered receivers. Because only the resulting activation SDR is transmitted, the size of traffic within the cluster is small, and most of the processing within a shard can happen in parallel with a relatively small window required for synchronization.

A typical setup for sequence memory would look something like this (where each box could execute in parallel, and clusters could be broken into any number of shards)

image

The encoder would transmit its output to an SRC which is responsible for the SDR representing active input cells. The spatial pooling process would be sharded (in this example each shard is responsible for half the minicolumns). They score just the minicolumns they are responsible for, then report their winners to a second SRC responsible for assembling the “active minicolumns” SDR.

The TM process would also be sharded (in this example, each shard responsible for a 25% of the minicolumns). The TM shards would be responsible for all their minicolumns’ cells and dendrites. They would receive the “active minicolumns” SDR from the SRC above, and perform the TM algorithm on only the minicolumns they are responsible for. They would report the activated cells to a third SRC responsible for assembling the “active layer cells” SDR. This would then be fed back to the TM shards for use in the TM algorithm (i.e. previously active cells)


How to distribute HTM computations?
#2

Would certainly be implementing this in the Julia implementation at some point. Though, maybe quite differently.


#3

Is the SP in your pic learning or just inferring? If it was learning
there’d also have to be data flowing back into it (‘up’ in the picture).
In any case the SP is basically all matrix-vector multiplication.
Distributing this is common practice.


#4

Well my original idea was that learning would be localized within each shard. This does imply that the output will always be evenly distributed across the shards, though, which has implications for topology. I haven’t personally had a need for topology, but you are right that solving that problem would require feedback from the SRC in order to choose winning columns that may be weighted more heavily in one shard compared to the others.


Local inhibition instead of explicit k-winner for spatial pooler
#5

So you want local inhibition instead of global. That’s a concern that’s
separate from “topology”. I think there is a cost to local inhibition
but it may work well enough.


#6

Topology aside, I don’t think functionally there would be a difference. The initial potential connections established in SP are random. If you are wanting 2% sparsity, for example, does it make a difference that the minicolumns chosen are distributed more evenly across the layer? Is there a capacity concern, or am I missing something important in my understanding of the SP process?


#7

Well with your “evenly distributed” approach you’re limiting yourself to
a subset of all possible SDRs. I think there will be a quality penalty
but that still has to be quantified. I’m not saying the impact is
significant, especially if the number of shards is low.


#8

Ah, yes of course. I had a feeling this would impact capacity, which looks like it would.

I’m thinking fixing this problem and topology would involve a specialized “SRC” for the SP process. The shards would do scoring and report their scores to the SRC, and it would select the winners and report them back to the shards to perform learning. I’ll need to think of a better name for this module so it isn’t confused with the other SRCs…


#9

Expanding on this solution a bit more, the amount of traffic transmitted within a spatial pooling cluster could be reduced by having the SP shards only report their top (sparsity * shard count) scores to this Spatial Pooling Controller (SPC). The SPC would join those and clip to just the top (sparsity) minicolumns. It would then report back only those winners which are relevant to each shard for learning.


#10

Btw regarding your pic. There is a convention here that the data moved
"upwards" for feed forward and “downwards” for feedback. It like having
north up on a map. Your pic is the wrong way round.


#11

:upside_down_face:


#12

Best use of emoji in a sentence :trophy: I laughed out loud. blue2 is right about the orientation though. I don’t see a reason your strategy couldn’t be made to work but the timing overhead might have to be quite disciplined to keep it from eating the advantages gained.


#13

Yes, that has been my concern with adding concurrency to HTM for some time. I have other strategies, but they tend to deviate from the classic algorithms. I think what makes this one different is that only the activation SDRs are transmitted as compressed array of indexes to the 1 bits (granted the SP cluster must transmit more information due to the problem @blue2 described). Limited information transfer means the window for synchronization between processes can be kept very small. Timing is essentially taken care of by the SRC which transmits the completed output only once it has been fully assembled.

I’ve started writing out a proof of concept in Golang to test the idea out. Will report my findings, however it turns out.


#14

Here is an updated diagram capturing the changes discussed so far.

image


Artificial Neural Network Implementation question
#15

I’ve been going deep into computer vision over the past couple weeks, using the openCV library. I don’t know what others here have experience with, but I strikes me that we should be able to store layers as either binary (black/white, if you will) or greyscale images that could easily be passed back and forth, with optimized bitwise operations already existing for photo manipulation. Layers can then be passed as lossless PNG or compressed JPG.

Also, I haven’t done much with it before, but I do see that tensorflow (which has good distributed graph support), does have bitwise operations as well.

https://www.tensorflow.org/api_docs/python/tf/bitwise

Has anybody already tried this?

( I first got into openCV using this tutorial series, if anyone is interested: https://www.youtube.com/watch?v=Z78zbnLlPUA&list=PLQVvvaa0QuDdttJXlLtAJxJetJcqmqlQq )


#16

Excelent idea. In this case, it could be a way to further compress the traffic within the spatial pooling cluster. It would also be interesting to see if a sparse (say 2%) array encoded as a PNG is more compressed than a dense array of indexes to the 1 bits. My suspicion is no, but I’ll have to check into this…


#17

Such partitioning looks like a perfect fit for the Actor Model. Each independent activity (SDR/pooler) could be modeled as an actor, even within one machine. The communication overhead on one machine would be negligible. Scaling out to other machines would be natural as well as messaging is transparent in the actor model.

The concurrency would not need complex synchronization given a good actor model implementation. The scheduler would automatically utilize available cores. (E.g. see Pony and Erlang scheduler details).

To communicate between the nodes, low latency zeromq could be used, serializing with msgpack, thus allowing for heterogeneous implementations.

Sorry for jumping in out of the blue. Will try to catch up with the discussion.


NuPIC implementation in Actor Model?
#18

I’ve set out to try this in Pony. Current strategy is to get synchronous basics off the Go implementation, and then see, how actors could fit in. → https://github.com/d-led/htm.pony
If anyone is willing to learn pony via Go + HTM, contributors are welcome


#19

I’ve read through some of the Pony tutorial, and it sounds good like a great match for the idea. What have you completed so far on the htm.pony project and what more needs to be done? I looked for board on github but I didn’t see any features outline or completed so I don’t know what has been done and what needs to get done.


#20

Ditto that. I’d be happy to contribute if somebody write up some outlines with some milestones.