Sparse Distributed Processing

AKA 100 million brains machine

The problem this machine attempts to solve is that of scaling - currently the HTM tests/studies I’m aware of are limited to modelling relatively small “cortex patches” (aka macro columns) spanning a few thousand mini columns while real mammalian brains may span up to over 100M units.

As the title suggests, the unit of computation in this machine is the mini column which I’ll further refer to as column.

I’ll try to be as terse as I can by using the following
example architecture (numbers may vary but 100x units are easily comprehensible):

  • Columns are arranged in a 10000x10000 2d matrix, aka cortex.
  • The “macro” columns (as in biology) are not fixed nor delimited within the cortex, but many 100x100, arbitrarily positioned, windows can be addressed (opened) at any time and be regarded as “macro columns”
  • There is a fixed underlying SDR representation for either inputs and outputs which any (mini-) column may “see” as an input. It is e.g. 10000 bits size (conveniently mapped as an 100x100 bit array). The 10000 SDR size isn’t suggested by anything else than further convenience and the intuition that it should be sufficiently large to provide a rich representation (aka embedding) of every context/thing and any complex relationships between related things that this machine might encounter, learn and think about in its lifetime.
    And to keep it within a reasonable processing capability of the underlying hardware.

So,
Every column can see the full 100x100 input SDR but it projects on a single point (bit) on a corresponding 100x100 output SDR.

For example the column placed at the address 1573x2545 in the cortex projects into bit 73x45 onto the output SDR. (it’s simple modulo 100 for both X and Y coordinates).

Now, it’s time to detail into the concepts of address (already highlighted above) and window activation yet first let’s recap what this machine is meant to achieve:

Implementing an 100M column model that doesn’t break the national grid and hopefully not the bank.

The way to do that is to have only 1% (1M columns) active at any time step, which should be spreadable on ~100 cores of ordinary machines.

How? By activating only 100 windows, (100x100 columns each) within the 10Kx10K cortex.

Ok, but how? By projecting the input SDR into 100 address points into the cortex in a manner that preserves similarities
For every address a corresponding 100x100 squared window originating at that respective point is activated.
100 windows 100x100 size each is 1% active columns.

What preserving similarities means exactly? It means that for two input SDRs:

  • if they are exactly the same , they will project exactly within the same set of windows
  • If they have a certain overlap (are similar) the corresponding active windows should have a sufficient overlap too.

The magic projecting/addressing tricks are hinted in lots of places, Kannerva’s SDM, associative memories and the random projection theory for dimensionality reduction are just a few of them.

The same architecture could be applied to any underlying processing/learning units instead of HTM’s minicolumns that spans in width instead of depth (which Deep Learning does).
They “minicolumns” could be shallow (a few layer) yet wide NNs, or even NEAT evolved networks or Random Trees - any small unit able to learn spatio-temporal patterns should be amenable to this wide, sparse processing structure .

In the unlikely case this is in any way a novel idea I’d like to call it Sparse Distributed Processing.

I will gladly deepen the topic, like how to connect actual multi modal sensory encodings, and to output responses, or in what ways this could be psychologically or biologically plausible, or how sequences of SDRs might be mapped into it in a manner that keeps the “threads” connected.

1 Like

Sounds like an interesting project, with a good bit of thought gone into it.

I can’t but ask: do you have a problem for which this is a solution?

One of the problems this design intends to solve is mentioned at the very beginning of the message.

If you are asking what particular initial test/toy problem I intend to solve as a proof of concept implementation, then this is an interesting question.

I think a good test would be to throw several small arbitrary problems at the same instance of the model.

E,G, teaching it to recognize MNIST digits, playing Pong, speech/command recognition, etc… with performance on each task not being affected by learning abilities to solve other tasks

Interesting model. I think this could be used used as a basic model that can be built upon but I would like some clarifications as I’m not really clear on how it works.

If I read correctly the input/output SDRs has 10k choose 100 number of patterns

While the cortex has infinitely more patterns e.g. (10k windows choose 100 windows) * (10k units per window choose 100 units)^(100 windows) * … or something to that effect

Given that the number of possible input SDR patterns are <<< number of cortex unit patterns I’m wondering how would you make full use of all the computing unit patterns in the cortex? Is there an RNN like mechanism where it needs to keep track of previous state as that might help use up more computing patterns in the cortex?

Also, is the input a floating point vector encoded into an SDR to activate some cortex computing units that the input vector will then be fed into? If that’s the case then I’m assuming then there’s no output SDR only input SDRs (as the output from the cortex will be encoded back into input SDR for the next iteration)?

No it projects on 100 addresses or points on a 2D sheet, not patterns.
I didn’t say to treat these 100 points within 100M possible as a meta-SDR. The meaning is handled within the address locations themselves.
Like - another analogy - at each address waits a “specialist” that learns to process only a narrow subset frames out of the whole … let’s call it “experience stream”.
The 100 count was just an idea, probably in practice the denser SDRs will have to project on more points, simply because the more points active in a SDR the more “stuff” is encoded by it.

Hopefully not - a certain sub-sdr should simply have a consistent projection on the same processing (specialists) addresses.

These are open details, I don’t have an exact model. I posted here because I started implementing a naive SDR associative map which, within certain limits, works pretty decent.

After all it can be regarded as a nearest-neighbor search which not only retrieves previously stored patterns, it also routes similar patterns towards same processing nodes placed within a virtual, arbitrarily large, flat neural network.

This can be applied to any encoding scheme/embedding - dense float vectors should work as well, actually there are more ANN (Approximate Nearest Neighbor) search libraries available out there because that’s what almost everybody is working with - dense vectors of floats
PS
Regarding

Well a simplifying idea to start with is to think at this SDR as an INPUT/OUTPUT Space, one that overlaps both output (predictions) from processing previous step and includes sensory updates too.

This is a tough (toughest probably) issue. Unlike flesh, machines have the advantage they can perfectly remember previous contexts within which a SDR appeared, if there-s an efficient lookup - ANN-search pops-up again

Approximate Nearest Neighbor, not be confused with Artificial Neural Network

Ohh i see. I thought each of the 100 windows activated by the input SDR would have sparse activations too. But you did say 1% of all columns are activated so I made wrong assumptions. If that’s the case the number of possible inputs is equal to the number of all possible combinations of activated columns in the cortex.

In my originally wrong assumption I was thinking that only 100 columns are activated per window. So that would be only 0.01% activated columns. That’s why I was wondering how to make full use of all the columns since the number of possible input SDRs are <<< number of all possible permutations of activated columns (at 0.01% sparsity). Tying this back to HTM sequence prediction I started to think that the same exact input SDR would activate different parts of the columns based on past activations (just like HTM sequence prediction). That was why I was asking about RNN or something.

100x100 = 10k which can learn a relatively decent amount of patterns, or more exactly sub-patterns or temporal sequences in vicinity of the key pattern.

I guess key pattern here is an useful concept - from the perspective of any active “window” is the one which activated it.

2 Likes