How are the cells per columns and length of sequence related?

Thank you and yes, that’s exactly the kind of calculation I was looking for. Obviously I need to read that paper again and/or more carefully.

BTW if that 2% means 20 on-bits, that implies 1000 columns and a total of 32000 cells, so about 10 transitions per cell. A simple melody might be 100 notes, but a classical piano sonata is more than 10,000. A full piano repertoire is going to take quite a few columns and cells to memorise.

There may be a good deal of nesting between the layers resulting in a dramatic reduction in storage requirements.

The layer of parsing/nesting is part of why the brain has such a vast storage capacity.

1 Like

I don’t see how one can possibly deduce the number of transitions per cell without knowing some other parameters besides sparsity, total minicolumns, and cells per minicolumn. The range of possibilities is enormous.

I’ll draw up some visualizations so someone can correct me if there is a flaw in my understanding of the theory (I readily admit that I am not a mathy person).

But before I do that, let’s just take the name of that paper at face value – neurons have “thousands of synapses”. Lets pick 2,000 (that being the smallest number which can be called “thousands”).

Lets take the layer dimensions that you described, where there are 1,000 minicolumns, 20 minicolumns per input, and 32 cells per minicolumn. If each cell has a max of 2,000 synapses, and if the activation threshold is set at 1 (in reality this would obviously be higher – I’m picking the lowest value because it results in the lowest capacity) then the cell will become predictive when any of some 2,000 cells become active. If we divide that by 20 (the number of minicolumns per input), then each cell can learn up to 100 transitions. This of course is already 10 times more than what you calculated, and it becomes astronomically higher as you increase the activation threshold.

4 Likes

I don’t see how one can possibly deduce the number of transitions per cell without knowing some other parameters besides sparsity, total minicolumns, and cells per minicolumn. The range of possibilities is enormous.

I just did that, quoting directly from the paper. Sparsity is 2%, cells per minicolumn is 32, total minicolumns is therefore 32/2%=1600. Just reading the paper. No magic.

I’ll draw up some visualizations so someone can correct me if there is a flaw in my understanding of the theory (I readily admit that I am not a mathy person).

I look forward to it.

But before I do that, let’s just take the name of that paper at face value – neurons have “thousands of synapses”. Lets pick 2,000 (that being the smallest number which can be called “thousands”).

Lets take the layer dimensions that you described, where there are 1,000 minicolumns, 20 minicolumns per input, and 32 cells per minicolumn. If each cell has a max of 2,000 synapses, and if the activation threshold is set at 1 (in reality this would obviously be higher – I’m picking the lowest value because it results in the lowest capacity) then the cell will become predictive when any of some 2,000 cells become active. If we divide that by 20 (the number of minicolumns per input), then each cell can learn up to 100 transitions. This of course is already 10 times more than what you calculated, and it becomes astronomically higher as you increase the activation threshold.

That’s wrong. No single cell learns a transition, it takes 20 cells per pattern to yield 10 active synapses and a 2:1 noise immunity. Your estimate is 10x what the paper says.

1 Like

You’re correct, I was using the word “transition” incorrectly. What I calculated is recognized patterns. That said, 200 patterns is incredibly low compared to what is possible (and even normal) with the algorithm. I’ll post some visualizations and hopefully get to the bottom of where I am going wrong.

Visualizations may take a while to draw, so let me start with a quick analysis of how one input A,A,A,A,… repeating can be represented. This may be enough to demonstrate if and where I am going wrong in my understanding of the algorithm with respect to capacity.

In the TM algorithm, when an input is completely unpredicted, all the cells in that input’s minicolumns activate (which the algorithm calls bursting), and one winning cell per minicolumn is chosen to represent the input in that context. The cell chosen for each minicolumn is chosen from the cell(s) which have the fewest distal segments, using a random tie breaker.

Suppose we have determined the first three inputs of A, and all cells now have one distal segment each. Let’s call these representations A(1), A(2), and A(3). If your layer dimensions were 4 minicolumns per input and 3 cells per minicolumn (and assuming SP boosting is not enabled), the representations might be something like this (we do not need to consider the other 196 other inactive minicolumns, since they will never have active cells in them in this scenario)

image

If “A” is input a fourth time, and a representation is chosen for A(4), the chances of the four random tiebreakers resulting in exactly one of these three representations is 3 / 3^4 (“number of repeated inputs” / “all possible representations”). Then for A(5) it would be 4 / 81, then for A(6) it would be 5 / 81, etc. The numerator increases by one for each new set of tiebreakers (i.e, the longer the sequence, the higher the likelihood of randomly selecting a representation that has already been used in the sequence).

What are the chances of this happening in the layer dimensions you mentioned earlier? The number of possible representations for 20 minicolumns containing 32 cells each is 32 ^ 20. Thus, in a layer of this size, it is infinitesimally likely that any element in a sequence repeating inputs would by random chance happen to be exactly the same as a previous representation, until that sequence has become astronomically long.

With this in mind, lets consider an HTM system with the dimensions that you mentioned earlier, trained on a repeating input A, A, A, A, A…

Lets train it like this, where we start with some other input Z after a reset, then increase the number of A’s one at a time. This will ensure that we always start with the A(1) representation (versus a burst in the A minicolumns which requires multiple iterations to disambiguate). Training the system this way will speed up the training process in case you want to repeat this experiment with NuPIC. The different representations involved will be learned like so:

(reset) Z → A(1)
(reset) Z → A(1) → A(2)
(reset) Z → A(1) → A(2) → A(3)
(reset) Z → A(1) → A(2) → A(3) → A(4)

(reset) Z → A(1) → A(2) → (…) → A(31) → A(32)

Now, lets assume we set the activation threshold to 1 (lowest capacity). When we begin inputting the next sequence into the pattern:

(reset) Z → A(1) → A(2) → (…) → A(32) → A(33)

The representation for A(33) will consist of random bits that are all each contained in one of the previous 32 representations for A (since all cells in the A minicolumns at this point will have been used exactly once) Thus, when A(33) becomes active, some combination of the previous A representations will become predictive, and the A minicolumns will never burst again (and thus no new representations for A will be chosen).

Thus the capacity of a system of this size and configuration (for a single repeating input) is 32 transitions. Hopefully it is intuitive that if the sequence has more diversity than just one repeating input (or is using SP boosting), the number of transitions would be larger than that. So 32 (i.e. the number of cells per minicolumn) represents a lower bound of capacity for this size of system, when the activation threshold is equal to one.

Now consider the other extreme, and assume the activation threshold is 20 (i.e. the number of minicolumns per input) (and max synapses per segment is at least 20). Now each segment fully connects to each representation and can uniquely distinguish them, so we are bounded only by the number of distal segments that a cell is allowed to grow. This can be set to any arbitrary value (assuming non biological plausibility).

Thus the capacity of a system of this size and configuration (for a single repeating input) with activation threshold 20 is 32 ^ 20 transitions (actually it will be something lower than this, because the unlikely random event I mentioned earlier of a previous representation being chosen will happen some time before then). This astronomical capacity of course comes at the cost of zero noise tolerance.

Adding in biological constraints, or configuring your system for various properties like noise tolerance, one-shot learning, etc. could place the capacity of a given HTM system with these same dimensions anywhere within this vast range of possibilities.

Does anyone see if and where I am going astray in my understanding of capacity?

3 Likes

I’m a bit confused by this graphic. Are the columns in each representation supposed to be minicolumns and the rows cells in them? If the spatial input A is seen in different context, the same minicolumns should be activated, but different cells within them active. Am I missing something? Or maybe you are just showing the active minicolumns and excluding non-active ones… in which case this makes sense, but is a bit confusing.

1 Like

Yes, showing the active minicolumns and excluding the inactive ones. In this case, there are 4 minicolumns per input, and three cells per minicolumn. Since I am repeating the same input (and SP boosting is not enabled), there is no need to draw the other minicolumns since they would never be active.

I’ll see if I can draw that a little better and will update my post.

Updated

3 Likes

To summarize my argument, what the Neuron Paper identifies as “the number of patterns recognized by the basal synapses of each neuron” is highly impacted (over many orders of magnitude) by several configurable parameters. These include the activation threshold, max synapses per segment, max segments per cell, SP boosting, and the diversity of the sequences being learned. Thus I do not think the original question, as stated, has a useful answer without also considering these other factors.

3 Likes

Do you accept the calculation in the ‘1000 synapses’? Here it is (again):

This can be calculated as the product of the expected duty cycle of an individual neuron (cells per column/column sparsity) times the number of patterns each neuron can recognize on its basal dendrites. For example, a network where 2% of the columns are active, each column has 32 cells, and each cell recognizes 200 patterns on its basal dendrites, can store approximately 320,000 transitions ((32/0.02)*200). The capacity scales linearly with the number of cells per column and the number of patterns recognized by the basal synapses of each neuron.

Seems pretty clear to me, and directly answers the question I asked, which was:.

Specifically, does the number of cells per column correlate with the length of sequence that can be recognised? Or with the number of different sequences recognised?

So what’s the issue, exactly?

1 Like

I suppose nothing really, except that without knowing how “the number of patterns recognized by the basal synapses of each neuron” is determined, does the answer really help you to understand how cells per column and length of sequence are related? The original question seems to imply that these two factors are closely linked. Just pointing out that there are additional factors to consider.

3 Likes

Just to toss in a non-Numenta confounding factor …
There is an issue that gets tossed around from time to time here: repeating sequences.

You can search for it and see some of the discussions.
I proposed habituation as a possibility in the solution set.

I only mentioned them here because I have been working with them for a while and it was easy to use it to demonstrate my point (it also is an extreme example of where diversity, or lack thereof, in the sequences being learned have an impact on actual capacity).

The real point here is that if one were to find themselves in a situation where their HTM configuration did not have enough capacity for the problem at hand, the best answer might not be to add more cells per minicolumn. Adjusting another property, such as the activation threshold, may be a better option, depending on the use case.

2 Likes

can u elaborate on the “nesting” behaviour of TM ? How it happens and how it works ?

I’m asking because I was thinking from another angle and came up with the same requirement for TM.
As far as I understand the TM algorithm I cant see how this will happen, yes for detecting variable order sequences, but not “nesting”.

See: temporal pooling.

By accident just read about temporal pooling earlier today … but this does not seem like nesting, more just like “labeling” sequences.

The output layer learns representations corresponding to
objects. When the network first encounters a new object, a sparse
set of cells in the output layer is selected to represent the new
object. These cells remain active while the system senses the
object at different locations. Feed forward connections between
the changing active cells in the input layer and unchanging active
cells in the output layer are continuously reinforced. Thus, each
output cell pools over multiple feature/location representations
in the input layer. Dendritic segments on cells in the output layer
learn by forming lateral modulatory connections to active cells
within their own column, and to active cells in nearby columns.
During training, we reset the output layer when switching to
a new object.

1 Like

That is an initial, naive implementation of TP (if you can even call it that yet), IMO. When we reach a level of sophistication where those pooled representations encode proper semantics and they are used themselves as components of other objects, then it really won’t be simple labeling anymore.

2 Likes

When considering temporal pooling, I typically come down in one of two places.

The first way of looking at it is that the pooling cells are acting like low pass filters. They perform a sort of temporal averaging in order to maintain a persistent representation of features in the domain that might be composed of smaller, more transient features on the input sensors.

The second way I’ve thought about them is to imagine the TP representations as the closed loops. That is to say that a persistent representation is formed by establishing a sequence of SDRs that repeat in a loop. I’m working on an implementation of this form of TP now.

For each of these, nesting occurs by associating transient inputs with the more stable representations. If one can assume that some of the more stable attributes of the sensed object/feature are encoded by the TP representation, then all the lower layers need to worry about is tracking the perturbations of the input from the mean expected behavior.

2 Likes

Sorry I was talking about nesting in sequences, not nesting of “labeled” objects
So in a sense you should have nesting on both levels.

f.e. if you have the following sequences :

 1. ABCDEF
 2. GBCHBCDXY

virtually the common parts are “compressed” on the fly (or may be when we sleep) :

R1: B,C
R2: R1,D

so they become :

1. A,R2,E,F
2. G,R1,H,R2,X,Y

TM VOSeq algo does not do that … it always “records” the full sequence


Why do TM has to be able to do that ?

The first minor benefit is the capacity of the TM will grow. Repetitive seq will take almost no space.

The major benefit is that will simultaneously encode all encountered sub-sequences too.
Partial matches of interactions can happen automatically … ++++

The drawback is that the branch(burst) logic will be more complex OR we need sleep-consolidation process.

BTW: There are online algos to do this type of compression.

1 Like

From my perspective, sequences are the same thing as objects (the only difference is where the distal signal is coming from).

I do not think that TP is part of TM (in my current understanding, these two processes must run in different populations of cells, because they require a temporal differential – this separation also matches TBT currently as well)

In any case, there is a lot of evidence that the brain does this sort of chunking, and HTM is ultimately intended to faithfully model biology. And really, just from observing myself how I replay music in my head, I know that I construct the “object” of a song in components (especially sections that repeat themselves – I don’t think of them as different, but as semantically identical other than their position).

This is very different than the way the TM algorithm alone currently functions, where for a given sequence, each iteration through its sub-sequences involves a completely different (semantically dissimilar) set of representations with virtually no overlap.

BTW, I posted on this thread a while back how I see this sort of thing working in a hierarchy, where abstractions work their way down the hierarchy the more frequently their components are encountered. I still see this as one of the requirements for a “good” TP algorithm.

1 Like