The Principle of Temporal Memory

The Principle of Temporal Memory

I’ve been so busy researching and testing so many things that I forgot to share my greatest INSIGHT about TM.

What is it ? It is the basic principle on which TM works !

Let me state it first and then I will elaborate.

Temporal Memory works by UPSCALING the input when learning (storing it in higher dimensional memory ) and then DOWNSCALING when predicting

This insight helped me to implement TM without implementing exact models of HTM-neurons with dendrites and synapses.
In essence TM at its simplest crude abstraction can be thought as higher dimensional UNION of smaller interconnected SDR’s, sort of.

The whole process of bursting and the rest can be implemented as UPSCALING the input SDR in intermediary higher dimensional buffer and storing the result in the Higher dimensional TM.
Then when predicting we can use another buffer to DOWNSCALE from the higher dimensional version stored in TM.

Why it works ?

The reason it works is because as we know the SDR capacity grows exponentially as the SDR size grows linearly.

Storing single 2000/2% bit SDR in 20row/2000cols i.e. 40000 bit SDR makes the sparsity drop to 0.05%.
So if we do it orderly we can pack many SDR’s. In addition in the original TM to activate bit requires 8-20 synapses, instead of 40. Plus the dependency allows to reuse neurons in patterns.

TM without SDR’s !!!

This opens the possibilities to implement TM with other data representations.

Integers

For example using pure Integer values ! Yes, thats what I said TM with simple Integers :slight_smile:

This is the second insight you will get in a single day :), there is third.

What capability of integers we can use so that we can implement UPSCALING and DOWNSCALING ?

There is this operation called modulo, using it we can divide the Integer in ranges.

what is the representational capacity of Integer using modulo ?

Here is quick calculation for different integer types :

uint16 : 65535 i.e. modulo 10 => ~6500 distinct up-scalings for 10 distinct items/categories/classes
uint32 : np.iinfo(np.uint32).max / 10000  => ~430_000 up-scalings for representing 10_000 distinct items 
uint64 : 1.8e15 up-scalings for representing 10_000 distinct items

Keep in mind that mlp NN can’t handle classifying ~100s of classes|categories.

BTW if you don’t use scaling you can use this idea to build a classifier, instead of TM

So how do we use this (Ex. using mod 10).
Here is how we can record multiple sequences.

1,2,3,4,5  =>  11,12,13,14,15
6,2,7,1,2  =>  16,22,17,21,32
7,3,9,1,2  =>  27,23,19,31,42

first we use base-num + 10, then in every new encounter of the same value we bump it with 10.
f.e. first 2 become 12, the second encounter of 2 in any sequence becomes 22 … and so on…

Why we encode it is this way ? Lets play it.

  • if we start with 6 we will latch onto 16 /which will carry us over the second sequence/
    then 16 will predict 22, if we get as a second value 2 we predicted correctly and will use 22 to predict 17 i.e. 7.
    You see if the first value was 1 i.e. 11 we would expect second value to be 12 not 22, this means first sequence

    The reason we have buffers is to keep the upscaled prediction, so that we don’t lose track.

  • if we start as a first value of 2 we have to choose between 4 options to choose from (a second visits/probability array may help us choose.)

We do it this way to save up-scalings and to avoid to make a decision if a number repeats (look at 22 second sequence), otherwise we could have done it this way, but now we will sometimes lose track :

1,2,3,4,5  =>  11,12,13,14,15
6,2,7,1,2  =>  26,22,27,21,22
7,3,9,1,2  =>  37,33,39,31,32

We can store the sequence (the former format) in the array and that is equivalent to SDR TM. Cute huh :wink:
Of course we can also be more like TM and store the transitions instead.

seq2 ::  16:22,22:17,17:21,21:32

We can also mimic integer neurons. What predicts 2, the list plays the role of pack of synapses.

2 <= 11,21,31,16

Vectors

We can also use good old vectors for more complex data, where the first element is the upscale-index.

1,3,3 => <1,1>,<1,3>,<2,3>
grid location seq : <1,(1,1)>,<1,(1,2)>
grid loc + cmd seq : <1,(1,1,up)>,<1,(1,2,down)>,<2,(1,1,end)>
SA Transitions :  <1,(1,1,start)>:<1,(1,1,up)>, <1,(1,1,up)>:<1,(1,2,down)>, <1,(1,2,down)>:<2,(1,1,end)>

to be continued …

1 Like

Would this only work for sequences of semantically dissimilar inputs? That would be the case when switching to other data representations like integers, for example. What about in the case of SDRs, though? I didn’t quite follow the mechanism behind upscaling and downscaling. Would that still support prediction from partial input SDR or an input SDR with errors (some percentage of randomly flipped bits)?

yep works for SDR too … that is where I figured out it from.
You have the TM memory and 3 buffers… data comes in normal SDR size and is mixed with Predicted higher dimensional buffer and is stored in the NOW buffer … then based on NOW and BEFORE buffers the upscaled SDR is stored in the Memory…, next based on BEFORE and the MEMORY => PREDICTED is set … next NOW => BEFORE …rinse and repeat…

if you got u can see that in the system it circulates always in higher dimension SDR

In this loop only DATA is lower dimension SDR

The temporal memory in this case is array 2000 rows (output neurons, correspond to HTM columns) and 40 columns (to store upscaled SDR indexes, selectors) …

Could be more than 40 to do Unions and such, I use Thinning by default.

The clever part is because it is indexed SDR for any size SDR, how much memory it uses depend only on how many bits are ON… 40/2000 and 40/10000 take the same space ;), just different indexes.
So the whole memory takes only (2000 * 40 * 2)/(1024*1024) = 0.15MB

tm
click to see it, cant upload .svg
https://raw.githubusercontent.com/vsraptor/ihtm/main/docs/jupyter/tm.svg

here is a detailed description :

1 Like

You setup the semantics… let say u decide to encode 10 colors/classes/categories… i.e. your encoder is just a map color to number.

I’m not sure for all cases, but in many situations storing integers sequences will require much less space… I’ve played with some variants of memory but havent settled on one.
It will most probably require shadow array with statistics about the Transitions, so u can pick the correct next step… tried probabalistic, S-curve, q-value in combination with exp decay, but have no measure to decide which is best

1 Like

I imagine using integers comes at a cost of no tolerance for noise or errors.

Although I don’t see that SDRs would have noise or error tolerance in the framework you are describing (unless I am misunderstanding it, which is likely). It seems like you are describing a stack or map from one encoding to another, and that map would require exactly the same first encoding to predict the second one?

1 Like

Yep with Integers and those-Vectors there is no concept of noise.

For the SDR there is noise tolerance. The selection/prediction process uses Winner-takes all based on overlap between the source SDR and the memory.
As for how noise tolerant it is your decision. if u pick 40 cols Thinning can accommodate some sort of resolution if you store/merge max 3-7 SDRs…
You can make it 1000bits if u want and use Union and/or Thinning …
You can even segment it and emulate dendrites … of course the memory requirement grows and selection algo becomes more complicated

You can think of it as HTM TM-map with 1 neuron with 1 dendrite segment with resolution 3-7 SDRs. You always keep it 1 neuron per mini-col no need to have many because of feed-forward characteristics.
But you can always add more resolution/bits to the single segment or add as many segments as u want

1 Like

Ah, of course. Basic SDR operations on the higher-order memory. Yes, this could provide some significant resource savings over the vanilla TM algorithm

Not being great with math myself, I’m not sure if long-term capacity is an issue in the upscaled SDR, but if it is, there is actually a simple a way you could drastically increase its capacity (at the cost of some extra noise, which would virtually never be high and easily dealt with with an overlap threshold). Rather than an orderly packing strategy, use a reversible hash to map a bit in the downscaled SDR and its position to a repeatable random location in the upscaled SDR (and the reverse of the hash to decode it later). Any two random SDRs when upscaled will have a chance of some bits overlapping, but the chances they have a lot bits overlapping would be infinitesimal.

1 Like

I assume with what u are describing u dont even have to mimic TM columns.
if i understood u describe using hashing and dehashing to pick a spot where to store the lower dim SDR.

good idea … because u are not stuck to make 2000 rows, just have flat higher dim memory SDR.

I mentioned this, but until u said it havent though about it … this could be great for declarative SDRs mem… TM requires connections between pairs because it stores transitions … unless hashing can accommodate that

1 Like

hmm…if u think about it a neuron can eventually connect to probably million of neurons … from this point of view a neuron can participate in 1mln bits virtual SDR !!

1 Like

It could be if there’s just a couple hundred minicolumns rather than 2000, like in some cortical columns.
I don’t understand the context of this so maybe that doesn’t make sense.

2 Likes

Sorry, sometimes it takes me a few iterations to grasp a good idea :wink:

I’ll explain this point by comparing a (probably naive) orderly packing strategy with a randomly distributed one.

Say each bit in the downscaled SDR is given their own pool of bits in the upscaled SDR. And say for the first context you want to store a particular input, you select the first position in the pool, second time you select the second position in the pool, etc. Thus the maximum number of contexts an input could be in before you start getting false positives, is the pool size. For a pool size of 4, you could store the input in up to 4 different contexts.

Now rather than that packing strategy, say you instead select a position from the pool randomly for each bit (in a way that can later be reversed). If each input is 40 bits, you still get to use the above 4 contexts, plus some orders of magnitude more where a few of the bits overlap with previous contexts, but that overlap is below a configured threshold.

A couple of other differences with this algorithm that come to mind:

  1. It seems like it would support one-shot learning only

There are some cases where multi-shot learning is preferable, though. A way around this would be to keep a map of each upscaled SDR to a connection strength variable, and increase it each time a transition is encountered.

  1. No obvious method for bursting

This is an essential property of TM, for recovering from point errors, disambiguating input, etc. You might have already come up with an implementation, but seems like this would require being able to quickly find all SDRs in the high-order memory that have bits in particular areas, and returning them as predictions if they have enough overlap.

  1. No partial predictions

In vanilla TM, if you input an SDR with, for example, 25% of its bits overlapping with a learned element in a sequence, the SDR of predicted cells is only partially filled in (depending on connection threshold). The new strategy makes all or nothing predictions. I’d have to think about what use-cases this might cause issues for, though. It’s just a difference that stands out to me.

1 Like

In my implementation I don’t burst the column (in the past when i did it was expensive because after i burst the col in the buffer i had to unburst the unselected bits afterwards) … i do “pre-burst” instead i.e. i either find the bit OR pick one that is available

1 Like

Well that went right over my head :crazy_face:

I’ll read through your code a bit before commenting further :sweat_smile:

it is easy to understand, hard to make it work right…
from BEFORE buffer you have the rows (1 … 2000) where to store the input SDR (NOW).
Now if it overlap > 90% just merge it with what is already there, otherwise pick one of the other bits in the same column preferably an empty one … so instead of bursting you PICK in advance … when u have to predict it will select the row u picked in advance because of its content

1 Like

Ok, I think I understand part of it. Basically, “Something I wasn’t predicting just happened, and I need to remember it”. But what about the following step? What is being predicted next? In vanilla TM, the predictions following a burst are every input that comes after the current input in all previously learned contexts. If I only store the new input but don’t do predictions, then the next input will also be considered unexpected, and the next one, etc.

Let me give an example to hopefully point out where I am misunderstanding. Say I have learned the following sequences:

A B’ C’ D’ E’ F’ G’ H’ I’
and
1 2’ C’’ 4’ 5’ 6’ 7’ 8’ 9’

Now I input the following sequence with a point error:

A B X C D E F G H I

When I reach the “C”, the vanilla TM algorithm would burst the “C” minicolumns (thus activating both C’ and C’’), resulting in cells for both D’ and 4’ in predictive state. When the D comes in next, the ambiguity would be resolved, and it would continue on correctly predicting the rest of the sequence from that point.

Does this implementation have a mechanism for recovering from a point error like above?

1 Like

let me first address this …
BEFORE is t-1
NOW is time t

when u learn BEFORE provides the content to be stored, NOW provides the rows where to be stored … this guarantees the sequence is followed if not overwritten.
When u have to predict you use NOW as content to search the memory … WTA decides what is selected in PREDICTED …

1 Like

if u get D’ in DATA and PREDICT predicted wrong … DATA will decide what goes into NOW … on the next predict it will use D’ to predict the correct SDR… i think…have to play it my head several times to be sure :wink:

As long as u get correct input in NOW and the memory was not overwritten u will get the correct prediction

it is good u challenge me to rethink … have not looked what i wrote in a while

1 Like

I have a variation on your basic idea that I’m considering implementing, just to challenge myself to think about the TM algorithm from a completely different perspective and find ways to address the differences (sort of like I’ve been waiting impatiently for Rain Neuromorphics to publish an emulator for their chip so I can write a TM algorithm for it :grin:) Would be fun to see if the idea could be expanded into components of TBT as well.

it seems this is point based neurons, are they ?

Well they aren’t really neurons OOB – they are just an array of memristors with random sparse connections between them. One could program all sorts of algorithms on top of that substrate. One way they could be used, might be to represent a bunch of potential synapses, with the resistance value representing permanence. Or maybe they could be used for random sub-sampling SDRs. Lots of interesting possibilities.