Alternative sequence memory

sequence-memory
temporal-memory

#1

Note: I am not suggesting the following method is a biologically plausible mechanism

How could a sequence of SDRs be stored in memory and be retrieved from any point? How could SDRs be stored in a sequence while maintaining their semantic value? These are the problems I will try to resolve here.

Mapping (associative memory)

The cortex could be thought of as a collection of connected maps. Each map assigns SDRs to other SDRs. If two SDRs are active simultaneously (‘spatial contact’) or they occur in close succession (‘temporal contact’) they form synaptic binding. This binding is the mapping that associates two or more SDRs together. If SDR A frequently activates prior to B, then A can be thought of as ‘mapping to B’. Relationships can go beyond one-to-one. A can map to B & C (one-to-many). A & B can map to C (many-to-one). A & B can map to C & D (many-to-many).

In a simple sequence this is very simple: [A,B,C,D] - A maps to B, B maps to C, C maps to D. These mappings are contained within a single map. Yet another sequence is mapped into the map: [X,B,C,Y]. With both [A,B,C,D] and [X,B,C,Y] existing within the same map, there are now two additional relationships: (many-to-one: A => B, X => B), (one-to-many: C => D, C => Y). If A & B were inputs to the map then C would be the output. If C was fed back into the map then both D & Y would be the output. This is problematic if the desired sequence was [A,B,C,D]. When C was fed back into the map it had no context so it simply returned all the mappings from C. This is a problem of ambiguity.

Contextual Sequences

A method (that has some resemblance with HTM temporal memory algorithm) solves the ambiguity problem by essentially re-encoding the output (based on the inputs) then feeding it back into the map with the next input. Given [A,B,C,D] A will map to K (where K is a random SDR - the ‘re-encoding’ of A), B & K maps to F, C & F maps to D.

image
image

The re-encoded outputs ((K,F), (S,R)) act as context for the next input. If C was fed into the map with the context F, then D will output. Else if C was fed into the map with the context R it will output Y. F represents the context B & K and K represents the context A. So C is fed into the map within the context of [A,B].

However, this is also problematic. The whole sequence is completely dependant upon the preceding SDRs. C cannot map to D without A being at the beginning of the sequence. Every SDR must be in the exact same order, leaving no room for invariance. For the same reason a sequence cannot be activated (or ‘accessed’) from any point, it must always start from the beginning of the sequence. Put another way - D cannot be retrieved without knowing what F is, but F cannot be known without knowing what K is, unless you backup all the way to the start at A.

Random Access Sequences

In order to access a sequence from any point, multiple SDRs that represent the context must be fed in simultaneously. This general idea can be found near the beginning of the book T.Kohonen Self-Organization and Associative Memory where the output is fed back into the map through multiple delays.

image

Given a sequence of [A,B,C,D,E,F], E is mapped along with the previously active SDRs (D,C,B,A) to F. In order to access F again, all that is needed is the same inputs in the same delay order. The inputs can be noisy or incomplete due to the flexible nature of associative memory. Each delay SDR must be re-encoded with a constant t+n SDR, otherwise the union of all the delay inputs would not encode their order.

image

The disadvantage this has compared to the previous method is that the contextual memory is limited. The previous method’s context sustained throughout the whole sequence, while this method’s context only sustains to the number of delay inputs. However, this is the same limitation humans face. When learning a sequence we can only hold a number of items of the sequence in mind at any one time.

Graded SDRs

There is another way to use previous SDRs as context. Instead of delays, ‘graded’ SDRs can encode the order of the previous SDRs. ‘Graded’ means that the active bits of an SDR can have a value greater than 1. This can represent the spike frequency of a neuron. If a neuron is highly active (high spiking frequency) it might have a value of 4 (the max is arbitrary), while a neuron with the lowest frequency will always be 1. (See here for how an SDR can have a graded representation.)

The current SDR (t+0) within a sequence could have a high spike frequency which could be represented with a value of 4. The value of the current SDR degrades over time as its frequency decays. t+1 has a value of 3, t+2 = 2, t+3 = 1, t+4 = 0. At any point in the sequence there is a union of previous SDRs with varying values representing their order in time.

image

In the example on the left, D is the current SDR, thus has a value of 4. As A was the current SDR 3 time steps ago it has decayed to a value of 1. This union of graded SDRs maps to the next SDR (E) without the use of re-encoding the inputs with t+n constants. For this reason the union fed into the map has high semantic value. Again, this method is robust to noise. As long as t+1 has a greater value than t then the encoding of order stands.

I’ve implement a working test to validate this theory today using my SDR library SDRKit. I used the SDRRepository & Buffer to achieve the above method. I am happy with the outcome as it seems quite robust. Even if the input union is incorrect (i.e t & t+1 are reversed) it seems to recover after a few steps, though I’m not sure how this works. Using the graded union, subsequences can be abstracted (ie [A,B,C,A,B,C,A,B,C] has a [A,B,C] subsequence) fairly easily (in theory) which will be a big part of the temporal hierarchy and higher-order context generation.

I’d like to read any thoughts/feedback on the ideas I’ve presented here as I hope they might be constructive in my venture :slight_smile: As SDRKit is written in Javascript I could post some working examples later on.


#2

Can’t say I understood how you see this working, but it’s definitely a very interesting goal to reach.


#3

It sure is. I didn’t mention how I’d go about doing it in the original post but involves buffering the graded unions over time and performing AND operation over the union to abstract the overlaps. Though I feel there’s probably a more elegant way of doing it.


#4

I’ve been thinking a lot about sequence memory recently. I’m thinking outside of HTM because I think SDRs can be processed generically without any particular framework. SDRs is the “language of the brain” and they can be processed in many different ways using many different operators (ie. or, and, xor, shift, add, sub, sum, mult, filter, etc.) on the computer. These operators can be linked together into a computation graph. TensorFlow does the same thing with tensors, but although they have a SparseTensor, they lack the special operations and helpers that make SDR processing practical (as far as I know…). So although the cortex has its own implementation on how to process and store SDRs, the computer can do it differently, but still reach the same goals.

A couple of these SDR operators (shift, sum, and) could be used to compute an SDR of an arbitrarily long sequence and abstract sub-sequence SDRs.

Given a sequence (ie. [A,B,C,D] where each letter is represented by an SDR) each SDR in the sequence is shifted equal to its time-step. If A was the current SDR it will have a time-step of t-0, then when B is fed in B will have a time-step of t-0 while A has a time-step of t-1. In this current state A’s bits have shifted equal to its time-step (-1). This continues until C = t-0 = shift-0, and A = t-3 = shift-3.

image

It is almost like an SDR lands onto a ‘grid of cells’ then incrementally drifts off to the right on each time-step. If the SDRs of all time-steps were summed (union) then the resulting SDR will represent [A,B,C,D] in that exact order. If the sequence was in any other order (ie flip the positions of A and D) then the summed representation would be different because the reordered SDRs will have a different shifts (A & D will have different shifts but B & C will remain the same - maintaining semantics).

image

Given another sequence [X,B,C,Y] where it is similar to [A,B,C,D] with the overlap [B,C]. The overlap [B,C] is an abstraction because it is a similarity between different sequences. It is one way of defining a sub-sequence. This abstraction could be computed by using a logical AND operator on the summed time-step shifted unions of both sequences to yield their overlap.

Although I have not yet explored it - a similar process could be used to abstract sub-sequences within a master sequence (ie. [A,B,C,D,A,B,C,D]).

So far in this thread I’m just throwing ideas about. I’d like to demonstrate them using a computation graph if I get the time next week.


#5

I like the idea. This could be extended to object recognition if you calculated evenly distributed shift factors over the range of locations you wanted to encode. Wider layers would be able to represent feature/locations with higher granularity.

The main problem with a pure SDR implementation of HTM (which I wasn’t able to figure out myself, unfortunately) is how to model generalization without the benefit of distal synapse learning. I’m definitely looking forward to seeing where your experiments go.


#6

Hi Paul, I’m not sure what you mean there. Would you mind explaining it further?

But you’re right, there are other ways of using shifts. It has occurred to me that shifts could encode many things into a single sdr. In the case above it just encodes the past with right-shifts (though they should be left shifts, my mistake). You could encode both past, present and future in one sdr. Past is left shift, present unshifted, and future right shift, with the future being predictions. To maintain sparsity the union will need to be sparsified.


#7

Sure. Lets say we have coffee mug like this:

image

It consists of a number of features. These can be encoded in SDRs. For example, lets consider these three:

image

Now we can imagine a grid of locations that encompass the object. For example:

image

Then we need to convert each of the 3D coordinates into a scalar value to use for shift. For a 3D object like this:

S = X + (Y * W) + (Z * W * H)

So for example, the feature at (1,1,0) would have shift 3.
image

A feature at (1,0,1) would have shift 5.
image

And a feature at (0,0,1) would have shift 4.
image

The same concept could be used for any number of dimensions.


#8

I think this is fascinating, Sebastian. Keep going with it!

On the note of object recognition, I’ve been toying with the idea of using a deep learning network (the first few layers of a convolutional neural network anyway) to do edge detection, and feed that into an SDR based system.

(Background on convolutional neural networks)
http://cs231n.github.io/convolutional-networks/

The point would be to semi-reproduce what our brains seem to do with vision, where corners, edges, and motion seem to be processed in the cerebral cortex, before sending data into the neocortex… the cerebral part has been trained and honed over millions of years of evolution, while our interpretation of what we see varies from culture to culture, I assert, based on experiences stored up in the neocortex. The benefit of using the DL model for edge detection is that it’s already trained to see those parts (I’d rip out the layers associated with classification). Then you would simply feed those detected edge/corner values into an SDR-based system (which is also aware of location/time/temperature based context through synchronous distal connections to other to other senses’ pooling layers). The hope would then be a more contextual, higher-level association of time/location/senses, in the same way that a smell, sight, or feel of morning sun can trigger memories… randomized, fluctuating weights of those distal connections between different pooling layers could work as a rudimentary seed of creativity (i.e. the ability to make abstract connections between seemingly unrelated ideas). Another goal would be the ability to more quickly learn/classify contexts and situations.

Where I’m more focused on earning income at this time of the year (gotta pay my bills and all), I haven’t put more than a few pages worth of notes into the concept. But if somebody’s looking for a project or inspiration, by all means feel free to take this idea and run with it.


#9

I get you now, cheers. Were you then thinking it would be combined into a union?

In these examples we’ve looked at the shifts using a 2d grid of cells (I suppose for visual clarity). In practise they would likely be done using 1d vectors of indices. However, 2d vectors have an advantage that you can exploits the X and Y dimensions for encoding (in 1d you can only use X - shift left and right). It might be possible for an SDR to also exist in a 3d vector where shifts also have Z to exploit. All operations could still be performed on a 2d & 3d SDR. I’m not sure if this would be necessary for your scenario, as spatial dimensions can be translated into 1d shifts. However, the 2nd dimension could encode how the object changes over time. So X encodes features in space, Y encodes how those features move/change through time. Z could encode a finite number alternate paths through time. This could be highly valuable for selecting a prediction that is related to a ‘goal state’ for example. As we look into the future we don’t just think linearly (single path through time), we also think laterally (many different paths through time) then select & combine paths in this space to produce a strategy (… in theory). So all this could be done in parallel.


#10

I suppose the advantages of using shifts could also be done using partitions. Essentially it is all about spreading the individual SDRs out in a union to avoid ambiguity.

With partitions, individual SDRs are delegated a partition within a larger SDR space. If an individual SDR has a range of 256 bits with a population of 8 active bits, then concatenate that 4 times, that’ll be 4 partitions within a master 1024 bit space with 32 active bits. To encode a sequence like before, instead of shifting them by t, the delegated partition is equal to t.

With using the same 2d grid visuals as before it would look like this.

concat

The ideas before are all the same. X Y Z can still encode all the same things and operations will work all the same. The difference is that it is simpler and you need larger SDR range to encode more things.


#11

The brain is unlike a computer in that a computer is very precise (ie if you change one bit in a byte in computer memory it changes the whole meaning of the representation, whereas in the brain it only slightly changes the meaning). The brain tends to be unprecise because it facilitates flexibility. SDRs are a great example of flexibility. However it seems sequence memory in computers tend to be quite rigid. The shifting example above - it encodes ABCD in that exact sequence. ABCD can be content-accessible by providing incomplete sequences A-C- or AB–, -B-D, A–D, etc. (where _ is a placeholder for any symbol) but that exact order needs to be maintained (A then B then C then D). ABCD will have little overlap when compared to AABCCD or A-BC-D even though our brains still perceive them as being similar. How could this invariance be applied when processing a sequence of SDRs?

A sequence can be spread out across a binomial tree, breaking the sequence into parts.

image

In first order memory A activates (or ‘maps to’) B, then B activates C, then C activates D. Then in the next level of the tree AB activates BC, then BC activates CD. The same applied further up the tree. This is lateral activation. There is also vertical activation wherein if A & B activate in that exact sequence then the node above (AB) will be activated. Then if AB & BC are activated in that exact order then the above ABC node will be activated. The core idea here is that a sequence is made up of discrete sub-sequences.

The sub-sequences in the tree can be compared to those in another tree to find overlap.

There is enough overlap in AABCCD to fulfil the ABCD sequence.

image

As BC & CD in the ABCD tree overlaps with BC and CD in the AABCCD tree it activates BCD which in turn activates ABCD.

Let’s take a less similar sequence - ABVVVCD. ABCD does exist in this sequence, it’s just that AB and CD are separated by a sub-sequence of VVV. (Even upon reading the sequence ABVVVCD as a human, it might not be immediately obvious that ABCD is in the sequence. This is simply because there slightly less sequential similarity/overlap).

Overlaps are only found on the first row of the tree. On a ABCD tree there is not enough vertical activation to fully activate ABC and BCD, however they could be partially activated, then combine vertically to partially activate ABCD. This partial activation is ideal as it shows that it is not a 100% match. This is analogous to the activation frequency of an SDR - if it is highly similar it has a higher frequency, else if it has a low similarity it has a low frequency. In this case if VVV was already a known sub-sequence then it will activate with a higher frequency than ABCD. In a competitive population of cells with lateral inhibition the VVV representation will inhibit the ABCD representation as it will have a higher frequency. However, if you had an attentional bias for ABCD (ie if you were looking for it) then top-down activation will increase the frequency of ABCD allowing it to inhibit VVV.

image

I will compare one more sequence that is very dissimilar to ABCD. The sequence AVBFCSD contains ABCD even though it is not very obvious. For me (as a human) I would not have spotted ABCD immediately, even if I was looking for it. However, using another method ABCD could still be detected.

As seen above, there is no direct overlap in any of the nodes. However, take the first row of AVBFCSD tree and apply each node into a union using an OR operation. Then apply another union of the first row of the ABCD tree. Then compare the unions by applying an overlap/AND operation to reveal the AB, BC, CD nodes. This same procedure can be applied to all the rows of multiple trees simultaneously. Using this method should ideally result in a low-frequency activation of the ABCD nodes.

I have yet to work out the implementation details of all this. The basic idea is that each node of the tree will encode the child nodes by bit-shifting them into a union. This will make comparisons with other nodes/trees quite simple as nodes can reverse-shift back to their original SDRs - maintaining full semantic value.


#12

Do you mean HTM, or other sequence memory algorithms? I would argue that HTM is not so rigid. Assuming an HTM layer has learned sequence ABCD. Here is how HTM would handle some of your examples:

AABCCD

Highlighting the elements that would burst:

AABCCD

Assuming a union of active cells in the TM layer (not the best TP algorithm, but an easy one to visualize), The union would contain all of the same cells as a union of the ABCD sequence, plus some extras in the C minicolumns. It thus shares the semantics of “bursting A”, “B after A”, and “D after C after B after A”.

ABVVVCD
The union for this one would be the same as the previous one, plus all of the cells in the V minicolumns. This is of course more dissimilar, but like the previous one, still shares the semantics of “bursting A”, “B after A”, and “D after C after B after A”.

AVBFCSD
This one of course has the most trouble. While technically it still contains all of the cells from the ABCD sequence, there is considerably more noise. Coincidentally, this is also the most difficult one for a brain to recognize (made easier with “voting” from other regions as we are discussing the sequence ABCD…)


#13

I wasn’t specifically referencing HTM as there are others I know of. But you’ve shown that there’s more I need to learn about the HTM TM as I was not aware it could do what you have shown.


#14

No problem. Of course the concept of “voting” is still being explored (for SMI, but I believe similar concept could be used in TM), so I shouldn’t overstate HTM capabilities. Your last example would not perform well in the current algorithms. The ABCD sequence would be predicted, but many other sequences containing the same letters would be as well.

In case it helps, here are some visualizations of what I posted above. Let’s assume the following minicolumns represent A-D (keeping it small and dense for the sake of demonstration – in practice, of course, there would be a lot more minicolumns involved to provide sparsity)

image

And the additional letters needed for the examples:

image

Now a trained HTM system that has learned ABCD will represent a union of the active cells in the TM layer like so (assuming sequence starts with a reset, so A minicolumns always burst):

image

Representations of the other sequences would look like so (orange highlighted cells indicate overlap with the above ABCD representation)

image


#15

Cheers! As you have probably noticed I’m a very visual person :slight_smile:

I’ve realized I have not fully watched all of Matt’s latest TM videos where he seems to go into more detail of TM than I was aware of before. So I’m checking them out now.

In the ABVVVCD example, could TM concurrently detect ABCD & VVV each as their own sub-sequences?


#16

Lets use “VFS” instead of “VVV” (repeating inputs are a special case, and would muddy the waters for this conversation). If we were to start with an HTM system that has learned two sequences ABCD and VFS, when it encountered the the first “V” in ABVFSCD, that “V” would be unexpected and its minicolumns would burst. This burst would put the “F” into a predictive state, and the “S” would be correctly predicted in the next timestep. Then the “C” would be unexpected and its minicolumns would burst. This burst would put the “D” into a predictive state.

Visualization of cell states through the ABVFSCD sequence:

image

Now if ABCD were represented as a union like in my previous post:

image

And VFS were represented like this:

image

The representation that came out of ABVFSCD would look like so:

image

So both sequences fully overlap (plus some extra cells in the “C” minicolumns). Both sequences would be predicted by this representation.

It is also worth noting that the two separate sequences could become merged into one new sequence if learning was enabled and the new pattern was encountered enough times (depending on the configuration).


#17

I may have noticed a few issues with this (if I understand it correctly):

  • In the ABVVVCD example C will burst only if a sequence that starts with VC has not already been learned. If It has then C will not burst so therefore D in the context of bursting C will not activate - C in the context of V will activate instead.

  • If D did activate in the context of C bursting it would not be in the context of AB occurring before CD. The sequence CDVVVAB will have the same representation as ABVVVCD as there is no indication of AB coming first. This causes AB…CD and CD…AB to be ambiguous.

    image

  • If ABVVVCD was a pattern that was already learned then the ABCD pattern would not have any overlap as C & D will have a different context so therefore will not activate the same cells in the C D columns.

I’m just playing devil’s advocate here. Please correct me on any of the points I’ve made.


#18

Devils advocate again - from this union I don’t know what came first. Was is it VFS, AB, CD? Or what came second or last? This union could represent either ABVFSCD, CDVFSAB, ABCDVFS, CDABVFS, VFSABCD or VFSCDAB.


#19

Correct. In that case, the burst would happen later in the sequence (once it varied from the hypothetical sequence that started with VC). These are small sequences here, but for larger sequences you can see how HTM quickly recovers after switching from one sequence to another and back again.

Correct, from this union you cannot tell them apart. That is one reason I mentioned that union isn’t the best TP strategy (also because it quickly becomes saturated, leading to false positives)


#20

I’m guessing the burst will occur on D as VC will have ended? I suppose this issue is similar to the last point I mentioned:

The more sequences that are learned that have the same SDRs, the less bursting there will be so therefore the less transitions between sequences occur. The more it learns the harder it becomes to determine the starting point of any sub-sequence.

I’ve heard there is another TP algorithm. Does it resolve this issue?