I had a thought recently about a possible strategy for adding concurrency to the classic HTM algorithms for use in multi-core processing. Could also be worked into a networked solution for horizontal scaling without breaking the bank (large-scale HTM on a Raspberry Pi cluster like this for example). Thought I would see if anyone has gone down this path and maybe identify some potential gotchas.
The basic idea would be to take a cluster of shards each process a sub-set of cells or minicolumns in parallel, and assemble their outputs into a completed SDR. This would be done by a leveraging a module which Iāll call the Sparse Representation Controller (SRC) which takes chunks of a representation and reassembles them:
An SRC would act as a message bus with receivers and transmitters. Shards that need to know complete activation SDRs would register as receivers, and related shards would register as transmitters to report their output. Once an SRC receives all outputs from transmitting shards, it would construct an activation SDR and transmit it to all registered receivers. Because only the resulting activation SDR is transmitted, the size of traffic within the cluster is small, and most of the processing within a shard can happen in parallel with a relatively small window required for synchronization.
A typical setup for sequence memory would look something like this (where each box could execute in parallel, and clusters could be broken into any number of shards)
The encoder would transmit its output to an SRC which is responsible for the SDR representing active input cells. The spatial pooling process would be sharded (in this example each shard is responsible for half the minicolumns). They score just the minicolumns they are responsible for, then report their winners to a second SRC responsible for assembling the āactive minicolumnsā SDR.
The TM process would also be sharded (in this example, each shard responsible for a 25% of the minicolumns). The TM shards would be responsible for all their minicolumnsā cells and dendrites. They would receive the āactive minicolumnsā SDR from the SRC above, and perform the TM algorithm on only the minicolumns they are responsible for. They would report the activated cells to a third SRC responsible for assembling the āactive layer cellsā SDR. This would then be fed back to the TM shards for use in the TM algorithm (i.e. previously active cells)
Is the SP in your pic learning or just inferring? If it was learning
thereād also have to be data flowing back into it (āupā in the picture).
In any case the SP is basically all matrix-vector multiplication.
Distributing this is common practice.
Well my original idea was that learning would be localized within each shard. This does imply that the output will always be evenly distributed across the shards, though, which has implications for topology. I havenāt personally had a need for topology, but you are right that solving that problem would require feedback from the SRC in order to choose winning columns that may be weighted more heavily in one shard compared to the others.
So you want local inhibition instead of global. Thatās a concern thatās
separate from ātopologyā. I think there is a cost to local inhibition
but it may work well enough.
Topology aside, I donāt think functionally there would be a difference. The initial potential connections established in SP are random. If you are wanting 2% sparsity, for example, does it make a difference that the minicolumns chosen are distributed more evenly across the layer? Is there a capacity concern, or am I missing something important in my understanding of the SP process?
Well with your āevenly distributedā approach youāre limiting yourself to
a subset of all possible SDRs. I think there will be a quality penalty
but that still has to be quantified. Iām not saying the impact is
significant, especially if the number of shards is low.
Ah, yes of course. I had a feeling this would impact capacity, which looks like it would.
Iām thinking fixing this problem and topology would involve a specialized āSRCā for the SP process. The shards would do scoring and report their scores to the SRC, and it would select the winners and report them back to the shards to perform learning. Iāll need to think of a better name for this module so it isnāt confused with the other SRCsā¦
Expanding on this solution a bit more, the amount of traffic transmitted within a spatial pooling cluster could be reduced by having the SP shards only report their top (sparsity * shard count) scores to this Spatial Pooling Controller (SPC). The SPC would join those and clip to just the top (sparsity) minicolumns. It would then report back only those winners which are relevant to each shard for learning.
Btw regarding your pic. There is a convention here that the data moved
"upwards" for feed forward and ādownwardsā for feedback. It like having
north up on a map. Your pic is the wrong way round.
Best use of emoji in a sentence I laughed out loud. blue2 is right about the orientation though. I donāt see a reason your strategy couldnāt be made to work but the timing overhead might have to be quite disciplined to keep it from eating the advantages gained.
Yes, that has been my concern with adding concurrency to HTM for some time. I have other strategies, but they tend to deviate from the classic algorithms. I think what makes this one different is that only the activation SDRs are transmitted as compressed array of indexes to the 1 bits (granted the SP cluster must transmit more information due to the problem @blue2 described). Limited information transfer means the window for synchronization between processes can be kept very small. Timing is essentially taken care of by the SRC which transmits the completed output only once it has been fully assembled.
Iāve started writing out a proof of concept in Golang to test the idea out. Will report my findings, however it turns out.
Iāve been going deep into computer vision over the past couple weeks, using the openCV library. I donāt know what others here have experience with, but I strikes me that we should be able to store layers as either binary (black/white, if you will) or greyscale images that could easily be passed back and forth, with optimized bitwise operations already existing for photo manipulation. Layers can then be passed as lossless PNG or compressed JPG.
Also, I havenāt done much with it before, but I do see that tensorflow (which has good distributed graph support), does have bitwise operations as well.
Excelent idea. In this case, it could be a way to further compress the traffic within the spatial pooling cluster. It would also be interesting to see if a sparse (say 2%) array encoded as a PNG is more compressed than a dense array of indexes to the 1 bits. My suspicion is no, but Iāll have to check into thisā¦
Such partitioning looks like a perfect fit for the Actor Model. Each independent activity (SDR/pooler) could be modeled as an actor, even within one machine. The communication overhead on one machine would be negligible. Scaling out to other machines would be natural as well as messaging is transparent in the actor model.
The concurrency would not need complex synchronization given a good actor model implementation. The scheduler would automatically utilize available cores. (E.g. see Pony and Erlang scheduler details).
To communicate between the nodes, low latency zeromq could be used, serializing with msgpack, thus allowing for heterogeneous implementations.
Sorry for jumping in out of the blue. Will try to catch up with the discussion.
Iāve set out to try this in Pony. Current strategy is to get synchronous basics off the Go implementation, and then see, how actors could fit in. ā https://github.com/d-led/htm.pony
If anyone is willing to learn pony via Go + HTM, contributors are welcome
Iāve read through some of the Pony tutorial, and it sounds good like a great match for the idea. What have you completed so far on the htm.pony project and what more needs to be done? I looked for board on github but I didnāt see any features outline or completed so I donāt know what has been done and what needs to get done.