Alternative sequence memory

Hi Paul, I’m not sure what you mean there. Would you mind explaining it further?

But you’re right, there are other ways of using shifts. It has occurred to me that shifts could encode many things into a single sdr. In the case above it just encodes the past with right-shifts (though they should be left shifts, my mistake). You could encode both past, present and future in one sdr. Past is left shift, present unshifted, and future right shift, with the future being predictions. To maintain sparsity the union will need to be sparsified.

Sure. Lets say we have coffee mug like this:

image

It consists of a number of features. These can be encoded in SDRs. For example, lets consider these three:

image

Now we can imagine a grid of locations that encompass the object. For example:

image

Then we need to convert each of the 3D coordinates into a scalar value to use for shift. For a 3D object like this:

S = X + (Y * W) + (Z * W * H)

So for example, the feature at (1,1,0) would have shift 3.
image

A feature at (1,0,1) would have shift 5.
image

And a feature at (0,0,1) would have shift 4.
image

The same concept could be used for any number of dimensions.

3 Likes

I think this is fascinating, Sebastian. Keep going with it!

On the note of object recognition, I’ve been toying with the idea of using a deep learning network (the first few layers of a convolutional neural network anyway) to do edge detection, and feed that into an SDR based system.

(Background on convolutional neural networks)
http://cs231n.github.io/convolutional-networks/

The point would be to semi-reproduce what our brains seem to do with vision, where corners, edges, and motion seem to be processed in the cerebral cortex, before sending data into the neocortex… the cerebral part has been trained and honed over millions of years of evolution, while our interpretation of what we see varies from culture to culture, I assert, based on experiences stored up in the neocortex. The benefit of using the DL model for edge detection is that it’s already trained to see those parts (I’d rip out the layers associated with classification). Then you would simply feed those detected edge/corner values into an SDR-based system (which is also aware of location/time/temperature based context through synchronous distal connections to other to other senses’ pooling layers). The hope would then be a more contextual, higher-level association of time/location/senses, in the same way that a smell, sight, or feel of morning sun can trigger memories… randomized, fluctuating weights of those distal connections between different pooling layers could work as a rudimentary seed of creativity (i.e. the ability to make abstract connections between seemingly unrelated ideas). Another goal would be the ability to more quickly learn/classify contexts and situations.

Where I’m more focused on earning income at this time of the year (gotta pay my bills and all), I haven’t put more than a few pages worth of notes into the concept. But if somebody’s looking for a project or inspiration, by all means feel free to take this idea and run with it.

I get you now, cheers. Were you then thinking it would be combined into a union?

In these examples we’ve looked at the shifts using a 2d grid of cells (I suppose for visual clarity). In practise they would likely be done using 1d vectors of indices. However, 2d vectors have an advantage that you can exploits the X and Y dimensions for encoding (in 1d you can only use X - shift left and right). It might be possible for an SDR to also exist in a 3d vector where shifts also have Z to exploit. All operations could still be performed on a 2d & 3d SDR. I’m not sure if this would be necessary for your scenario, as spatial dimensions can be translated into 1d shifts. However, the 2nd dimension could encode how the object changes over time. So X encodes features in space, Y encodes how those features move/change through time. Z could encode a finite number alternate paths through time. This could be highly valuable for selecting a prediction that is related to a ‘goal state’ for example. As we look into the future we don’t just think linearly (single path through time), we also think laterally (many different paths through time) then select & combine paths in this space to produce a strategy (… in theory). So all this could be done in parallel.

1 Like

I suppose the advantages of using shifts could also be done using partitions. Essentially it is all about spreading the individual SDRs out in a union to avoid ambiguity.

With partitions, individual SDRs are delegated a partition within a larger SDR space. If an individual SDR has a range of 256 bits with a population of 8 active bits, then concatenate that 4 times, that’ll be 4 partitions within a master 1024 bit space with 32 active bits. To encode a sequence like before, instead of shifting them by t, the delegated partition is equal to t.

With using the same 2d grid visuals as before it would look like this.

concat

The ideas before are all the same. X Y Z can still encode all the same things and operations will work all the same. The difference is that it is simpler and you need larger SDR range to encode more things.

1 Like

The brain is unlike a computer in that a computer is very precise (ie if you change one bit in a byte in computer memory it changes the whole meaning of the representation, whereas in the brain it only slightly changes the meaning). The brain tends to be unprecise because it facilitates flexibility. SDRs are a great example of flexibility. However it seems sequence memory in computers tend to be quite rigid. The shifting example above - it encodes ABCD in that exact sequence. ABCD can be content-accessible by providing incomplete sequences A-C- or AB–, -B-D, A–D, etc. (where _ is a placeholder for any symbol) but that exact order needs to be maintained (A then B then C then D). ABCD will have little overlap when compared to AABCCD or A-BC-D even though our brains still perceive them as being similar. How could this invariance be applied when processing a sequence of SDRs?

A sequence can be spread out across a binomial tree, breaking the sequence into parts.

image

In first order memory A activates (or ‘maps to’) B, then B activates C, then C activates D. Then in the next level of the tree AB activates BC, then BC activates CD. The same applied further up the tree. This is lateral activation. There is also vertical activation wherein if A & B activate in that exact sequence then the node above (AB) will be activated. Then if AB & BC are activated in that exact order then the above ABC node will be activated. The core idea here is that a sequence is made up of discrete sub-sequences.

The sub-sequences in the tree can be compared to those in another tree to find overlap.

There is enough overlap in AABCCD to fulfil the ABCD sequence.

image

As BC & CD in the ABCD tree overlaps with BC and CD in the AABCCD tree it activates BCD which in turn activates ABCD.

Let’s take a less similar sequence - ABVVVCD. ABCD does exist in this sequence, it’s just that AB and CD are separated by a sub-sequence of VVV. (Even upon reading the sequence ABVVVCD as a human, it might not be immediately obvious that ABCD is in the sequence. This is simply because there slightly less sequential similarity/overlap).

Overlaps are only found on the first row of the tree. On a ABCD tree there is not enough vertical activation to fully activate ABC and BCD, however they could be partially activated, then combine vertically to partially activate ABCD. This partial activation is ideal as it shows that it is not a 100% match. This is analogous to the activation frequency of an SDR - if it is highly similar it has a higher frequency, else if it has a low similarity it has a low frequency. In this case if VVV was already a known sub-sequence then it will activate with a higher frequency than ABCD. In a competitive population of cells with lateral inhibition the VVV representation will inhibit the ABCD representation as it will have a higher frequency. However, if you had an attentional bias for ABCD (ie if you were looking for it) then top-down activation will increase the frequency of ABCD allowing it to inhibit VVV.

image

I will compare one more sequence that is very dissimilar to ABCD. The sequence AVBFCSD contains ABCD even though it is not very obvious. For me (as a human) I would not have spotted ABCD immediately, even if I was looking for it. However, using another method ABCD could still be detected.

As seen above, there is no direct overlap in any of the nodes. However, take the first row of AVBFCSD tree and apply each node into a union using an OR operation. Then apply another union of the first row of the ABCD tree. Then compare the unions by applying an overlap/AND operation to reveal the AB, BC, CD nodes. This same procedure can be applied to all the rows of multiple trees simultaneously. Using this method should ideally result in a low-frequency activation of the ABCD nodes.

I have yet to work out the implementation details of all this. The basic idea is that each node of the tree will encode the child nodes by bit-shifting them into a union. This will make comparisons with other nodes/trees quite simple as nodes can reverse-shift back to their original SDRs - maintaining full semantic value.

2 Likes

Do you mean HTM, or other sequence memory algorithms? I would argue that HTM is not so rigid. Assuming an HTM layer has learned sequence ABCD. Here is how HTM would handle some of your examples:

AABCCD

Highlighting the elements that would burst:

AABCCD

Assuming a union of active cells in the TM layer (not the best TP algorithm, but an easy one to visualize), The union would contain all of the same cells as a union of the ABCD sequence, plus some extras in the C minicolumns. It thus shares the semantics of “bursting A”, “B after A”, and “D after C after B after A”.

ABVVVCD
The union for this one would be the same as the previous one, plus all of the cells in the V minicolumns. This is of course more dissimilar, but like the previous one, still shares the semantics of “bursting A”, “B after A”, and “D after C after B after A”.

AVBFCSD
This one of course has the most trouble. While technically it still contains all of the cells from the ABCD sequence, there is considerably more noise. Coincidentally, this is also the most difficult one for a brain to recognize (made easier with “voting” from other regions as we are discussing the sequence ABCD…)

1 Like

I wasn’t specifically referencing HTM as there are others I know of. But you’ve shown that there’s more I need to learn about the HTM TM as I was not aware it could do what you have shown.

No problem. Of course the concept of “voting” is still being explored (for SMI, but I believe similar concept could be used in TM), so I shouldn’t overstate HTM capabilities. Your last example would not perform well in the current algorithms. The ABCD sequence would be predicted, but many other sequences containing the same letters would be as well.

In case it helps, here are some visualizations of what I posted above. Let’s assume the following minicolumns represent A-D (keeping it small and dense for the sake of demonstration – in practice, of course, there would be a lot more minicolumns involved to provide sparsity)

image

And the additional letters needed for the examples:

image

Now a trained HTM system that has learned ABCD will represent a union of the active cells in the TM layer like so (assuming sequence starts with a reset, so A minicolumns always burst):

image

Representations of the other sequences would look like so (orange highlighted cells indicate overlap with the above ABCD representation)

image

1 Like

Cheers! As you have probably noticed I’m a very visual person :slight_smile:

I’ve realized I have not fully watched all of Matt’s latest TM videos where he seems to go into more detail of TM than I was aware of before. So I’m checking them out now.

In the ABVVVCD example, could TM concurrently detect ABCD & VVV each as their own sub-sequences?

1 Like

Lets use “VFS” instead of “VVV” (repeating inputs are a special case, and would muddy the waters for this conversation). If we were to start with an HTM system that has learned two sequences ABCD and VFS, when it encountered the the first “V” in ABVFSCD, that “V” would be unexpected and its minicolumns would burst. This burst would put the “F” into a predictive state, and the “S” would be correctly predicted in the next timestep. Then the “C” would be unexpected and its minicolumns would burst. This burst would put the “D” into a predictive state.

Visualization of cell states through the ABVFSCD sequence:

image

Now if ABCD were represented as a union like in my previous post:

image

And VFS were represented like this:

image

The representation that came out of ABVFSCD would look like so:

image

So both sequences fully overlap (plus some extra cells in the “C” minicolumns). Both sequences would be predicted by this representation.

It is also worth noting that the two separate sequences could become merged into one new sequence if learning was enabled and the new pattern was encountered enough times (depending on the configuration).

1 Like

I may have noticed a few issues with this (if I understand it correctly):

  • In the ABVVVCD example C will burst only if a sequence that starts with VC has not already been learned. If It has then C will not burst so therefore D in the context of bursting C will not activate - C in the context of V will activate instead.

  • If D did activate in the context of C bursting it would not be in the context of AB occurring before CD. The sequence CDVVVAB will have the same representation as ABVVVCD as there is no indication of AB coming first. This causes AB…CD and CD…AB to be ambiguous.

    image

  • If ABVVVCD was a pattern that was already learned then the ABCD pattern would not have any overlap as C & D will have a different context so therefore will not activate the same cells in the C D columns.

I’m just playing devil’s advocate here. Please correct me on any of the points I’ve made.

Devils advocate again - from this union I don’t know what came first. Was is it VFS, AB, CD? Or what came second or last? This union could represent either ABVFSCD, CDVFSAB, ABCDVFS, CDABVFS, VFSABCD or VFSCDAB.

Correct. In that case, the burst would happen later in the sequence (once it varied from the hypothetical sequence that started with VC). These are small sequences here, but for larger sequences you can see how HTM quickly recovers after switching from one sequence to another and back again.

Correct, from this union you cannot tell them apart. That is one reason I mentioned that union isn’t the best TP strategy (also because it quickly becomes saturated, leading to false positives)

I’m guessing the burst will occur on D as VC will have ended? I suppose this issue is similar to the last point I mentioned:

The more sequences that are learned that have the same SDRs, the less bursting there will be so therefore the less transitions between sequences occur. The more it learns the harder it becomes to determine the starting point of any sub-sequence.

I’ve heard there is another TP algorithm. Does it resolve this issue?

That is a good analysis. I don’t personally see this as an indication that HTM is on the wrong path, though, since I believe brains suffer from the same problem. Suppose you as a human have learned a few sequences equally well, such that they are immediately recognizable to you. Also assume these are long sequences… I’m just showing a few letters of them:

…ABCDEFG…
…BCDEFAB…
…CDEFBCD…
…SEFCDEFS…

And then out of the blue, you unexpectedly start to hear “DEF…” (assume there is no other input providing context). There is no way you could tell me which of the above sequences is playing, but you would certainly expect that it is one of the five since you know them so well. As soon as the next letter comes in, you will immediately know which sequence you are in. My point is that even for a brain, it can take a few samples to clear up ambiguity when context has been lost.

I have experimented with several. One of the easier ones to visualize which would address this problem is based on logarithmic decay, something like the concept of eligibility traces from RL. The idea is that inputs that happen most recently would be more heavily represented than inputs which happened further back in time. Thus more cells representing “D after C after B after A” would be active than cells representing “S after F after V”, which would be more than the number of cells representing “B after A” and so-on. With a representation like that, one could tell that “VFS” happened after “AB” but before “CD”.

That’s super interesting. I came to somewhat a similar idea the other week (although from a different perspective) which I described in the original post under ‘Graded SDRs’ where a union of SDRs within an arbitrary time-window are ‘weighted’ depending on how recent they are. However that idea was replaced with bit-shifting because it avoids using graded SDRs while maintaining the use of binary SDRs and all the ordinary operations to process them. Though I find it interesting that you could quantify t+n of an SDR based on the number of cells representing it, I’ve not thought of that.

When you think of plugging into a hierarchy (theoretically – I haven’t actually tried it yet), you can imagine that this will impact which minicolumns win out during SP in the next higher level. You should be able to get a range of outputs that share different degrees of overlap, because the receptive fields of some minicolumns will favor a particular element in a sequence being more heavily represented, while other minicolumns will favor other elements of the sequence being more heavily represented.

I’ve been thinking about this TP issue and thought I’d try the other TP method we briefly discussed in another thread.

The code I use for the following examples can be found in this gist.

To see it in action check out this codepen

In the above gist I’ve written a network that does first-order-memory (layer 4) and temporal pooling (layer 3). I have used two sequences for demonstration - ABCD & XBCY. I have not implemented learning as I just want to demonstrate recall, so the weights are hard-coded. The symbols (XABCDY) are represented as single cells instead of SDRs for simplicity.

Layer 4 is simple as it just does first-order-memory. So A or X activates B, B activates C, C activates D and Y. All the weights in the matrix are binary, so the pre-synaptic neuron will activate the post-synaptic neuron instead of just depolarise/predict, I’ll explain why later. All the synapses between all the neurons in layer 3 are in the distal weight matrix.

Layer 3 has two cells to represent the two sequences. I have labelled them N & K. N represents ABCD, and K represents XBCY. While either sequence from layer 4 is playing out, the associated cell in layer 3 remains active. The synapses between layer 4 and 3 are in the proximal weight matrix. The weights are continuous. They simply represent the order in which the inputs arrive. For example, for N: A=4, B=3, C=2, D=1; for K: X:4, B=3, C=2, D=1. This simple encoding allows for competition between N & K during recall. If A was fed in from layer 4 then N will have a value of 4, and K will have a value of 0. Feed in B from layer 4, and feed back in the output from layer 3, then N will equal 43 and K will equal 3. The value for N is higher than K, and will remain that way for the duration of the sequence if C and D follow. I’ll explain later how layer 3 cells have their activation calculated.

The output from layer 3 is fed back into layer 4 through via the apical weights. By summing the inputs and weights from the distal and apical synapses the cell with the highest value is selected to be active. This again is based on the idea of competition. After C has been fed in, N will equal 432 and K will equal 32. C activates D & Y, but as D has apical synapses from N, D will have a value equal to N, while Y will have a value equal to K, so therefore D wins out.

The proximal synapses in layer 3 detect sequences in a specific order. A cell that represents a particular sequence from layer 4 will increase in value as the sequence plays out. For ABCD, N will increment like so: N=4, N=43, N=432, N=4321. If the sequence from layer 4 is incorrect the layer 3 cell will still increment, but just not as much. So for ACBD, N will increment like so: N=4, N=42, N=423, N=4231. For DCBA, N will increment: 1, 12, 123, 1234. The value of a cell in layer 3 is calculated: layer3[i] = layer3[i] * 10 + layer4[j]. So cells in layer 3 that have similar representations will have closer values than those with dissimilar representations.

In the code above I just input the beginning of the sequence (A or X) and let the memory unfold automatically over time. The in interesting thing is that if I input X, the layer 3 cell K has already predicted the whole sequence. The same for input A.

image

1 Like

Cool, I’ve been meaning to put together a test of this myself, but hadn’t gotten around to it. Of course the main deviation in your case is using the same cell for “B after A” and “B after X”, as well as “C after B after A” and “C after B after X”. It is interesting that having the pooling layer makes up for the ambiguity that this would normally lead to.

BTW, the order of the sequence inputting to K in your diagram is out of order (XYCB) :stuck_out_tongue_winking_eye: