Alternative sequence memory

sequence-memory
temporal-memory

#1

Note: I am not suggesting the following method is a biologically plausible mechanism

How could a sequence of SDRs be stored in memory and be retrieved from any point? How could SDRs be stored in a sequence while maintaining their semantic value? These are the problems I will try to resolve here.

Mapping (associative memory)

The cortex could be thought of as a collection of connected maps. Each map assigns SDRs to other SDRs. If two SDRs are active simultaneously (‘spatial contact’) or they occur in close succession (‘temporal contact’) they form synaptic binding. This binding is the mapping that associates two or more SDRs together. If SDR A frequently activates prior to B, then A can be thought of as ‘mapping to B’. Relationships can go beyond one-to-one. A can map to B & C (one-to-many). A & B can map to C (many-to-one). A & B can map to C & D (many-to-many).

In a simple sequence this is very simple: [A,B,C,D] - A maps to B, B maps to C, C maps to D. These mappings are contained within a single map. Yet another sequence is mapped into the map: [X,B,C,Y]. With both [A,B,C,D] and [X,B,C,Y] existing within the same map, there are now two additional relationships: (many-to-one: A => B, X => B), (one-to-many: C => D, C => Y). If A & B were inputs to the map then C would be the output. If C was fed back into the map then both D & Y would be the output. This is problematic if the desired sequence was [A,B,C,D]. When C was fed back into the map it had no context so it simply returned all the mappings from C. This is a problem of ambiguity.

Contextual Sequences

A method (that has some resemblance with HTM temporal memory algorithm) solves the ambiguity problem by essentially re-encoding the output (based on the inputs) then feeding it back into the map with the next input. Given [A,B,C,D] A will map to K (where K is a random SDR - the ‘re-encoding’ of A), B & K maps to F, C & F maps to D.

image
image

The re-encoded outputs ((K,F), (S,R)) act as context for the next input. If C was fed into the map with the context F, then D will output. Else if C was fed into the map with the context R it will output Y. F represents the context B & K and K represents the context A. So C is fed into the map within the context of [A,B].

However, this is also problematic. The whole sequence is completely dependant upon the preceding SDRs. C cannot map to D without A being at the beginning of the sequence. Every SDR must be in the exact same order, leaving no room for invariance. For the same reason a sequence cannot be activated (or ‘accessed’) from any point, it must always start from the beginning of the sequence. Put another way - D cannot be retrieved without knowing what F is, but F cannot be known without knowing what K is, unless you backup all the way to the start at A.

Random Access Sequences

In order to access a sequence from any point, multiple SDRs that represent the context must be fed in simultaneously. This general idea can be found near the beginning of the book T.Kohonen Self-Organization and Associative Memory where the output is fed back into the map through multiple delays.

image

Given a sequence of [A,B,C,D,E,F], E is mapped along with the previously active SDRs (D,C,B,A) to F. In order to access F again, all that is needed is the same inputs in the same delay order. The inputs can be noisy or incomplete due to the flexible nature of associative memory. Each delay SDR must be re-encoded with a constant t+n SDR, otherwise the union of all the delay inputs would not encode their order.

image

The disadvantage this has compared to the previous method is that the contextual memory is limited. The previous method’s context sustained throughout the whole sequence, while this method’s context only sustains to the number of delay inputs. However, this is the same limitation humans face. When learning a sequence we can only hold a number of items of the sequence in mind at any one time.

Graded SDRs

There is another way to use previous SDRs as context. Instead of delays, ‘graded’ SDRs can encode the order of the previous SDRs. ‘Graded’ means that the active bits of an SDR can have a value greater than 1. This can represent the spike frequency of a neuron. If a neuron is highly active (high spiking frequency) it might have a value of 4 (the max is arbitrary), while a neuron with the lowest frequency will always be 1. (See here for how an SDR can have a graded representation.)

The current SDR (t+0) within a sequence could have a high spike frequency which could be represented with a value of 4. The value of the current SDR degrades over time as its frequency decays. t+1 has a value of 3, t+2 = 2, t+3 = 1, t+4 = 0. At any point in the sequence there is a union of previous SDRs with varying values representing their order in time.

image

In the example on the left, D is the current SDR, thus has a value of 4. As A was the current SDR 3 time steps ago it has decayed to a value of 1. This union of graded SDRs maps to the next SDR (E) without the use of re-encoding the inputs with t+n constants. For this reason the union fed into the map has high semantic value. Again, this method is robust to noise. As long as t+1 has a greater value than t then the encoding of order stands.

I’ve implement a working test to validate this theory today using my SDR library SDRKit. I used the SDRRepository & Buffer to achieve the above method. I am happy with the outcome as it seems quite robust. Even if the input union is incorrect (i.e t & t+1 are reversed) it seems to recover after a few steps, though I’m not sure how this works. Using the graded union, subsequences can be abstracted (ie [A,B,C,A,B,C,A,B,C] has a [A,B,C] subsequence) fairly easily (in theory) which will be a big part of the temporal hierarchy and higher-order context generation.

I’d like to read any thoughts/feedback on the ideas I’ve presented here as I hope they might be constructive in my venture :slight_smile: As SDRKit is written in Javascript I could post some working examples later on.


#2

Can’t say I understood how you see this working, but it’s definitely a very interesting goal to reach.


#3

It sure is. I didn’t mention how I’d go about doing it in the original post but involves buffering the graded unions over time and performing AND operation over the union to abstract the overlaps. Though I feel there’s probably a more elegant way of doing it.


#4

I’ve been thinking a lot about sequence memory recently. I’m thinking outside of HTM because I think SDRs can be processed generically without any particular framework. SDRs is the “language of the brain” and they can be processed in many different ways using many different operators (ie. or, and, xor, shift, add, sub, sum, mult, filter, etc.) on the computer. These operators can be linked together into a computation graph. TensorFlow does the same thing with tensors, but although they have a SparseTensor, they lack the special operations and helpers that make SDR processing practical (as far as I know…). So although the cortex has its own implementation on how to process and store SDRs, the computer can do it differently, but still reach the same goals.

A couple of these SDR operators (shift, sum, and) could be used to compute an SDR of an arbitrarily long sequence and abstract sub-sequence SDRs.

Given a sequence (ie. [A,B,C,D] where each letter is represented by an SDR) each SDR in the sequence is shifted equal to its time-step. If A was the current SDR it will have a time-step of t-0, then when B is fed in B will have a time-step of t-0 while A has a time-step of t-1. In this current state A’s bits have shifted equal to its time-step (-1). This continues until C = t-0 = shift-0, and A = t-3 = shift-3.

image

It is almost like an SDR lands onto a ‘grid of cells’ then incrementally drifts off to the right on each time-step. If the SDRs of all time-steps were summed (union) then the resulting SDR will represent [A,B,C,D] in that exact order. If the sequence was in any other order (ie flip the positions of A and D) then the summed representation would be different because the reordered SDRs will have a different shifts (A & D will have different shifts but B & C will remain the same - maintaining semantics).

image

Given another sequence [X,B,C,Y] where it is similar to [A,B,C,D] with the overlap [B,C]. The overlap [B,C] is an abstraction because it is a similarity between different sequences. It is one way of defining a sub-sequence. This abstraction could be computed by using a logical AND operator on the summed time-step shifted unions of both sequences to yield their overlap.

Although I have not yet explored it - a similar process could be used to abstract sub-sequences within a master sequence (ie. [A,B,C,D,A,B,C,D]).

So far in this thread I’m just throwing ideas about. I’d like to demonstrate them using a computation graph if I get the time next week.


#5

I like the idea. This could be extended to object recognition if you calculated evenly distributed shift factors over the range of locations you wanted to encode. Wider layers would be able to represent feature/locations with higher granularity.

The main problem with a pure SDR implementation of HTM (which I wasn’t able to figure out myself, unfortunately) is how to model generalization without the benefit of distal synapse learning. I’m definitely looking forward to seeing where your experiments go.


#6

Hi Paul, I’m not sure what you mean there. Would you mind explaining it further?

But you’re right, there are other ways of using shifts. It has occurred to me that shifts could encode many things into a single sdr. In the case above it just encodes the past with right-shifts (though they should be left shifts, my mistake). You could encode both past, present and future in one sdr. Past is left shift, present unshifted, and future right shift, with the future being predictions. To maintain sparsity the union will need to be sparsified.


#7

Sure. Lets say we have coffee mug like this:

image

It consists of a number of features. These can be encoded in SDRs. For example, lets consider these three:

image

Now we can imagine a grid of locations that encompass the object. For example:

image

Then we need to convert each of the 3D coordinates into a scalar value to use for shift. For a 3D object like this:

S = X + (Y * W) + (Z * W * H)

So for example, the feature at (1,1,0) would have shift 3.
image

A feature at (1,0,1) would have shift 5.
image

And a feature at (0,0,1) would have shift 4.
image

The same concept could be used for any number of dimensions.


#8

I think this is fascinating, Sebastian. Keep going with it!

On the note of object recognition, I’ve been toying with the idea of using a deep learning network (the first few layers of a convolutional neural network anyway) to do edge detection, and feed that into an SDR based system.

(Background on convolutional neural networks)
http://cs231n.github.io/convolutional-networks/

The point would be to semi-reproduce what our brains seem to do with vision, where corners, edges, and motion seem to be processed in the cerebral cortex, before sending data into the neocortex… the cerebral part has been trained and honed over millions of years of evolution, while our interpretation of what we see varies from culture to culture, I assert, based on experiences stored up in the neocortex. The benefit of using the DL model for edge detection is that it’s already trained to see those parts (I’d rip out the layers associated with classification). Then you would simply feed those detected edge/corner values into an SDR-based system (which is also aware of location/time/temperature based context through synchronous distal connections to other to other senses’ pooling layers). The hope would then be a more contextual, higher-level association of time/location/senses, in the same way that a smell, sight, or feel of morning sun can trigger memories… randomized, fluctuating weights of those distal connections between different pooling layers could work as a rudimentary seed of creativity (i.e. the ability to make abstract connections between seemingly unrelated ideas). Another goal would be the ability to more quickly learn/classify contexts and situations.

Where I’m more focused on earning income at this time of the year (gotta pay my bills and all), I haven’t put more than a few pages worth of notes into the concept. But if somebody’s looking for a project or inspiration, by all means feel free to take this idea and run with it.


#9

I get you now, cheers. Were you then thinking it would be combined into a union?

In these examples we’ve looked at the shifts using a 2d grid of cells (I suppose for visual clarity). In practise they would likely be done using 1d vectors of indices. However, 2d vectors have an advantage that you can exploits the X and Y dimensions for encoding (in 1d you can only use X - shift left and right). It might be possible for an SDR to also exist in a 3d vector where shifts also have Z to exploit. All operations could still be performed on a 2d & 3d SDR. I’m not sure if this would be necessary for your scenario, as spatial dimensions can be translated into 1d shifts. However, the 2nd dimension could encode how the object changes over time. So X encodes features in space, Y encodes how those features move/change through time. Z could encode a finite number alternate paths through time. This could be highly valuable for selecting a prediction that is related to a ‘goal state’ for example. As we look into the future we don’t just think linearly (single path through time), we also think laterally (many different paths through time) then select & combine paths in this space to produce a strategy (… in theory). So all this could be done in parallel.


#10

I suppose the advantages of using shifts could also be done using partitions. Essentially it is all about spreading the individual SDRs out in a union to avoid ambiguity.

With partitions, individual SDRs are delegated a partition within a larger SDR space. If an individual SDR has a range of 256 bits with a population of 8 active bits, then concatenate that 4 times, that’ll be 4 partitions within a master 1024 bit space with 32 active bits. To encode a sequence like before, instead of shifting them by t, the delegated partition is equal to t.

With using the same 2d grid visuals as before it would look like this.

concat

The ideas before are all the same. X Y Z can still encode all the same things and operations will work all the same. The difference is that it is simpler and you need larger SDR range to encode more things.