Growing Temporal Memory (GTM)

TLDR Here I propose

  • a TM block which instead of having a fixed depth (column size) it starts with all columns shallow and each column grows (stacks new cells) as necessary,
  • how TM learning rules could be changed in order to accomplish the above.
  • what advantages - in terms of learning rate and compute/memory resources this new TM breed might have

We know (at least most of us do) a TM is organized in a matrix of columns X cells, one column purpose is to predict one future SDR position (or bit), and is made by stacking a fixed number of cells, same depth for all columns. When a cell activates the whole column activates.

The algorithm being that each cell tries to make predictions of the following time step by “looking” for, and learning patterns in the activations of all other (or only neighboring) cells in the previous time step.

What is important to notice is the architecture (and therefore learning capacity) is fixed from the beginning - a certain column depth (# of cells/column) is assumed for the whole lifecycle of the TM.

A problem with the fixed TM size is its “Maker” has to guess in advance what architecture is optimal for the given task - too shallow HTM can’t learn complex relationships, a too large one wastes resources.

To go on the Expansive TM would be sketched like this:

  • first, for simplicity sake, each cell is responsible to learn a single pattern - it has one segment that turns it active.

  • start with one cell depth in each column. Cells mature very quickly, which means once they “locked in” to a pattern their synapses freeze and non-permanent synapses are discarded.
    That brings in two benefits:

    • the learning algorithm can skip mature segments (efficiency). Normally one learning cell per column is sufficient.
    • mature segments do not need to hold synaptic permanences, all synapses are permanent (memory savings)
    • mature segments reliably activate each time when they see the pattern they locked into (consistent behavior)
  • there are two errors a segment/cell can do:

    1. to not activate when it should
    2. to activate when it shouldn’t

    How does GTM handles these:

    1. in this case a new immature cell is stacked on top of the column and its purpose is to learn a new pattern that the cell(s) below it did not noticed.
    2. in this case an “inhibitive” segment is added to the cell - its purpose is to learn a particular exception pattern that is supposed to inhibit the cell’s activation.
  • As you noticed, I almost lied above: A cell doesn’t have a single segment, it has a single activation (positive) segment and zero or more inhibitory (negative) segments, yet only one of them can be in immature stage - learning an inhibitory pattern. A new inhibitory segment is added whenever all preexisting inhibitory segments failed to inhibit the cell (hence column) activation when they should have.

  • as a bonus to speed - as long as a positive segment does not activate, its corresponding inhibitory segments can be skipped (do not need to check whether they see an inhibitory pattern). So a cell can learn many exception rules without having to evaluate them at every time step. Further on, for cells with multiple inhibitory segments, they are evaluated in the order of maturation - and once one inhibits the cell we (the algorithm actually) can stop checking for further exception rules for the same cell.

  • The consistency rules:

    • a learning segment can only get its input from mature segments. This way is guaranteed whatever the reason (pattern) the segment has learned, its future activations will not be hindered by its input cells “changing their mind” by learning new rules. That is why an input synapse should not point towards a whole cell but one certain segment within that cell. (0,1,2… where “0” is the positive segment)
    • Mature segments are assigned a maturity value 1… N (where 1 is the oldest mature segment and N is the newest one. So segment i can only read its inputs (== attach synapses) from segments 1, 2,…, i-1
  • further optimizations: linking inhibitory segments: When a cell in a column miss-fires, before adding an new, immature inhibitory segment, algorithm can check all previous inhibitory segments in its column, and if any one checks true, it simply symlinks it in the newer cell, which means that “the already learned inhibitory rule applies in this particular case too”

Ok, I hope it does look simple enough to raise your interest.
Opinions are welcomed e.g. what use cases vanilla TM can solve and this one couldn’t or how difficult would be to implement it, or how do you think the presumed performance improvements would really matter.

I personally like the following benefits:

  • that a column can be as shallow as necessary. If a pattern bit is easy to predict, why allocate dozens of cells?
  • and even a column not having learning segments/cells as long as it makes correct predictions.
  • once an older cell decides to activate (its positive pattern says “GO!” and its negative patterns do not oppose it) then there is no need to ask/evaluate newer
    cells from the same column.
  • since by Pareto rule, 20% of relevant patterns for each column would be encountered 80% of the times, it means the most relevant patterns would tend to be learned sooner, hence the advantage above (bypass rare complicated rules) is further magnified.
2 Likes

@cezar_t Our BrainBlocks SequenceLearner works pretty close to this. The cells per column is still fixed, because otherwise we would have to build dynamic memory allocation which I haven’t got around to. However, it is conservative in allocating activations to new cells. Preferably, it tries to reuse previously used cells if any of those cells were predicted, otherwise, it activates a previously unused cell to represent the novel state.

Keeping track of this historical property is something that HTM doesn’t do but BrainBlocks does. In practice, it results in insanely fast sequence learning and often times, many of the cells of each column are never activated or used. Unused cells have no dendrites or synapses so there’s risk of them activating unless there’s no predicted cells on the active column. In that case, a raw cell is activate and a dendrite is grown connecting to a collection of previously active cells on other columns that could presumably predict this new pattern.

The activate loop starts here:

We tried to find a predicted cell to activate on the column here:

And if none is found, a new cell is selected and connected to previously active cells here:

We only needed to add dynamic memory allocation to increase the size of columns, but it was never a high enough priority.

3 Likes

Ok, that’s cool!

The allocation doesn’t need to be dynamic, this kind of reallocation makes sense via “nap breaks”: dump the network to a file, instantiate a new, deeper one, reload the saved network into the the larger instance. At least when it doesn’t run some critical real time applications.

Probably easy to make in python, if the dumped network state (which I assume your suite can do) can be parsed in and rearranged, reallocated off line

1 Like

In practice that’s never going to be an issue.

For example let’s say you have 400 columns, 10 cells per column, and 20 active columns.
In this example the TM will activate between 20 and 200 cells each cycle.
The TM represents each thing that it can detect using a group of 20 cells.

How many different things can the TM detect?
Naively you might think 200 cells total divided by 20 cells per thing equals 10 things.
However the answer is actually 200 choose 20 which equals 1,613,587,787,967,350,073,386,147,640 things.
Now obviously discriminating between all of those cell-patterns is impossible because some of them differ by only one cell, so the true capacity is somewhere in between 10 and 10^27.

2 Likes

Well it’s fixed. Any kind Hypothetical counting seem to make sense till you hit practical limits.
Let’ say speculate a human learns one thing every second for 50k seconds / day for 10k days that’s 500M things in 27 years.
(Which we know it is grossly exaggerated)
According to your computation, “we-re fine that’s ifinitesimal we can represent 10**27 things with our network”

While natural brains have at few orders of magnitude more cells than “things” they can learn.
Wonder why?

1 Like

I don’t think this is true, and I’m not sure how you could prove it either way.

1 Like

That is on the input side, which is fair enough, but what about the real aspect as to what a column is actually identifying… an abstract concept. Hence, why scans show up localisation of concept stimuli rather than those concepts being fully scattered throughout the cortex.

If we take the math of an unlimited combination of inputs you are still using the colums for conceptual identification and it is the bounds on how many concepts you can handle, which is more critical.

Either way your system will never be defined or constrained by the input sensory combinations that are experienced, rather the number of abstract concepts those sensory streams identify or are expected to deal with. Also note that with more complex networks than biological equivalents they will start to identify concepts that we can’t recognise or did not know existed.

The limiting factor is more likely the number of columns, so you either need a very large number to start with and do not add any more (close to a human cortext) or you add more as your system becomes conceptually constrained.

Is it more about looking at the outputs rather then the input combinational explosion (we can experience an almost infinitie combination of sensory inputs) ?

2 Likes

No, the same logic also applies to the patterns of active mini-columns.
If there are 400 mini-columns, and 20 of them are active, how many different patterns are there?
400 choose 20 is 2,788,360,983,670,896,737,872,851,072,994,080 or 10^33.

1 Like

Yes, you can identify a lot of patterns, but in the end you can identify 10,000 variations of the number 1 but all your doing and defining is a pattern hierarchical process of identifying the concept of 1. The output from those full columns (not mini) is the limiting factor. i.e. the output that interplays with the thalamus and not the lateral column-column connections.

Column-column I understand as teporal variance of pattern recognition. i.e. allowance for differing temporal size of the sensory pattern to be detected.

The perspective I am also looking at this from is the corpus collosum cross hemisphere synchronisation / messaging where those interconnects are concepts / via cortical output and not necessarily primary sensory (occipital crossover). The brain is synchronised / crosslinked via hierarchical concepts and not necessarily primary sensory stiluli.

Thats’s just my perspective and understanding and I still have a lot to learn.

1 Like

There is no reason to tie concepts remembered to no of cells. If you accept the HTM, SM and SDR concepts, then a million columns, 1000 neurons per column, 100 synapses per neuron, 10 bits per SDR is 10^12 possible SDRs (or thereabouts). Probably enough.

2 Likes

Ok, if you think you can learn more than one pattern per second then you should be able to remap the alphabet’s 26 letters, followed by the 27’th question mark symbol, to the following images, in less than 30 seconds:
pattern_challenge

1 Like

You know I did that whole video lecture on this topic…

1 Like

Yes that’s pretty much what a less distractive paper or couple of the HTM School videos explains.

Both seem to imply that a large representational capacity (of a given size SDR) automagically translates in an ability to actually learn the same amount of patterns - or (earlier) to compute.
A 640KB memory can have 256 ** 640000 states, so obviously 640K should be more than anyone will ever need.

PS the actual capacity to learn of a TM can be measured.. It is nothing like “we can only speculate about”. Take 128 random SDRs, one for each ASCII character and feed TM literature. How much text can it learn before forgetting. Nothing fancy like MDP or transformers, just sequence learning and prediction.

I would be curious to see a test like that, or equivalent.

1 Like

If you so doubt these ideas then I urge you to experiment with them, and find the truth of the matter. Hands on experience is a great way to build an intuitive understanding of how these things work. And good luck growing your TM :slight_smile:

2 Likes

It’s not about doubting the idea of usefulness of SDRs as representations or embeddings. What I have an issue with is the assertion that one shouldn’t care about the depth of a TM (or any network) since SDRs are so powerful.
But they are just representations - inputs or outputs - not computational units as dendrites are.

As you already know, you need a ridiculously large SDR (or SP providing it) in order to touch a meager 97% in MNIST classification, achievable by a simple kNN or a mediocre (in size&compute) MLP.

Why a thing so powerful as 10^NNN possibilities can’t help with MNIST?

2 Likes

@cezar_t I actually tried working on solving the representational capacity of an SDR-space analytically. I called it the gnome-packing problem.

I stopped using SDR as a term since they are not always sparse or distributed, but instead adopted the term “gnome” (as in gnomic meaning). I defined it as any instance of the set of all binary representations for a given N bits. However, we interpret the bits as follows: ‘1’ indicates presence of meaning, and ‘0’ indicates absence of meaning, but not its negation.

Sometimes I just use the term, binary pattern, to mean to same thing if a person finds the term, gnome, confusing.

The gnome-packing problem is analogous to the general packing problem in sets. To solve it, you need to settle on the number of bits N in your representational space, the size of each gnome w in number of ‘1’ bits, and r is the maximum number of overlap or intersection of ‘1’ bits between any pair gnomes in the packing. The goal is to find the maximum number of gnomes P, of weight w, that can fit into an N-size bitstring with at most r common set bits between any pair of gnomes.

A simple case would be N=4, w=2, and r=0. The max packing of gnomes would be P=2 by example:

0011
1100

There are other combinations to be found, but the maximum is always the same.

A slightly less simpler case would be N=4, w=2, and r=1. Enumerating a gnome-packing that fits the parameters would be:

0011
0110
1100
1001

This find max P=4. You can verify correctness compare any two rows and only find one common bit between them. You can hand-build your own gnome-packing with an enumeration like this and making sure the conditions hold. In some, you will find that the distribution of bits will be uneven and will result in lop-sided over or under-used bits.

I did a number of larger examples by hand to get a better intuition. You can change the problem by instead of maximizing P, you instead specify P and then determine the size N you would need for a given w and r. Or you could try to solve for w or r if any 3 out of 4 of the parameters are given.

You could also solve for a packing by specifying the max representational capacity of any single bit. That is, no bit can have more than m gnomes in common. We call m a bit’s multiplicity. This one in particular is interesting because it examines how much multi-tasking a single neuron would do when representing multiple things.

It turns out that this type of problem is actually an established area of mathematics called block design or design theory. I haven’t had a lot of chance to go into it, but it seems more art than science. Regardless, there should be tools in here to measure an SP or TM’s representational capacity if you establish some constraints like the parameters above. Given that 0’s don’t have the same representational parity as 1’s, solving this packing problem is more like set-packing or combinatorics than pure base-2 representations.

4 Likes