SDR Value Maps

What does it mean to attach value to a SDR? the best way I can put it is with an example.

Let’s say we have a reinforcement learning agent that uses SDRs to encode observed state in its environment. It can be based on HTM or other type of predictor that uses SDRs as an underlying representation.

Now in the RL environment, in certain situations the agent encounters either positive or negative rewards. One problem with TM or other sequential state predictors is that rewards/penalties usually are received within few states, which can be very sparse within time domain.

A value map is useful to infer value from current and predicted states.
More precisely, for an agent that can predict(current_state, action) -> next_state
it would be useful to learn a notion of how valuable or important might be an immediate future state that can be predicted from current state and the palette of available actions.

A more concrete example is gym’s CartPole - every time the cart falls off the table or the pole tilts more than 20 degrees, the current round is ended with high negative reward.

An HTM-like memory can be pretty good at modelling physics with short-term consequences of small actions, but it could be quite difficult to learn longer-term ones.

The fact that a finer-grained time encoding (e.g. double observations/second) benefits accuracy at the physical model is counter balanced by the consequence of doubling the number of steps between relevant events.

Moving further with cart pole example:

If the reward of falling is -100, then we can mark the previous 10 steps with an increasing absolute value level, so “danger score” for state SDR observed 5 steps before falling would be -50

Then we can write these scores into a value map and at every future time T, the TM knowing “physics” can predict possible states at T+1 for every available action, and the agent algorithm can query the value map to pick the least dangerous action.

An interesting property of a value map is its permanency. Once the agent learns to stay upright the danger map remains unchanged and “vigilant” even after a million successful rounds.

Even more interesting is that for a highly “dangerous” (or valuable) state, agent can figure out which of the state bits represent a specific value or danger and this is an amazingly powerful feature because allows it to rewrite an automatic policy that both identifies particular “dangers” and directly maps them to “recommended actions” in a simpler, quick-reaction specialist “autopilot”.

Another use for a value map could be as guides to exploring/learning environments - if a value map for “novelty” is simply incremented for every new state, then later the agent can use it both as a sensor for familiarity of current states and search for unexplored spaces within its environment.

1 Like

Enough talk, how it is done, using 1000/20 bit SDR as example:

  • the value map is a large array of integers - e.g. 1M or 100 million integers. Lets call this size M
  • the 20 on bits of the SDR are grouped in 1140 unique ordered triples, such every triple contains 3 different bits in ascending order: b1, b2, b3
  • an address is computed by
    A = (b1 * 1000000 + b2 * 1000 + b3) % M
  • The integer value in the map at position A is incremented by 1 or the specified value increment.

So each unique triplet of bits is mapped to an address point, The number of possinle address points for a SDR with P on bits is
Np = P*(P-1)*(P-2)/6

Since this virtual space grows with the cube of SDR size a smaller space can be used together with division modulo space size addresses.

To later check the alleged value of any future SDR there can be several query functions:

  • one that computes an average of the triplet values computed for all address points
  • for dense SDRs (large P-s): one that picks a random subset of triplet bits and computes their average value. This could be used for faster, less precise “sniffing” for value of a state
  • one that returns most high valued individual triplets - this can be used to “decode” which bit correlations within the SDR encode the value. Note that this later information will point not only to individual bits but also to correlations between several pairs or triplets of bits.
    In the cart pole example it might point to certain combinations of speed and position that are “dangerous”.
1 Like

I like your toy examples. Thanks for sharing.

On the last bullet point of your last post, I’m not sure how to calculate correlation and what kind of method is used to only update the value of the relevant triplets while not affecting triplets uncorrelated with current value/state. E.g. if only 30% of the triplets are strongly correlated to a dangerous state with huge negative value, the other 70% of triplets shouldn’t be updated much as they aren’t correlated to the danger state with huge negative value.

I’m thinking that we could append a Value SDR (where the value or reward is encoded into an SDR) and combine it with the state SDR. The values are not stored in the triplets instead the triplets will store information like the reverse indexer example you did previously where we can then get back the Value SDR solely from the State SDR. Just brainstorming some ideas to address the confusion I wrote in the above paragrpah.

Thanks. The idea is flexible, my example was with triple points, that gets expensive, a 40 bit sdr would expand to almost 10000 addresses.
A smaller, faster map could use 2 bit positions addresses.

Regarding collisions there are two types:

  • the ones provoked by not using a large enough map.
    A 3 bit position map has a virtual space of N*(N-1)*(N-2)/6 points, where N is sdr size - if one is restricted in memory addresses will start to collide.
    BUT for a 40 ON bits the number of updated addresses is 10000 - so even with e.g. 20% collision rate, the average value should be significant.
  • the second type of collisions between two SDRs is those caused by the fact they actually share the same 3 bits. This is the type of collisions that we are interested to find out because they may reveal correlation within underlying data. In case of our cart-pole example, is pole angle is encoded as 4 bits, all scenarios in which pole drops will show very high value of the associated bit triplets - in this case 4 triplets with excessive high penalty because all deaths by falling to the left will touch the same specific area in the value map.

Here-s an update, starting with a cartpole result which I find very interesting:

27: reward: 150, length:150
28: reward: 145, length:145
29: reward: 182, length:182
30: reward: 23, length: 23
Hooray we won!!! 31: reward: 500, length:500
Hooray we won!!! 32: reward: 500, length:500
Hooray we won!!! 33: reward: 500, length:500

That above is an agent’s output while learning to balance the cartpole. After first 30 rounds of learning and failing she didn’t failed till the 1000th episode.

More precisely the learning stage needed 2304 time steps, and 165ms.

I admit I cherry picked it, but on average its failures are below 50 at the beginning of the training.


The SDR value map I made for this is two-dimensional, mapping only bit pairs not triplets as I fancied above.
That makes it quite simple, small and fast.

What I also found interesting was a relatively high density small SDR (28/200) is much better than a large, sparse SDR (16/400)
Simply because neighbor states overlap, the map is filled quickly while encoding sufficiently smooth reward gradients.
Will follow with code…

PS also beware the agent ignores gym’s reward and runs its own… “survival” policy

Another detail:

since the agent updates its value maps using only past 18 steps preceding death, the number of actual time steps used to inform its unfailing policy is well under 1000.

Here-s the example code, needs numba
https://raw.githubusercontent.com/Blimpyway/sdr-machine/main/sdr_value_map.py
https://raw.githubusercontent.com/Blimpyway/sdr-machine/main/cartpole_play.py
https://raw.githubusercontent.com/Blimpyway/sdr-machine/main/sdr_util.py

1 Like

Here-s an update:

  • changing ValueMap to accept/read a whole column of values.
  • which makes it quite performant when using it as a classifier.

Here are some preliminary results on Fly-hashed MNIST digits:

Fly-hashing test and train digits to 50/2000 SDRs..  18.16s, done!

Training for 50 batches:
 1 Training with 10000 random digits..........1/2 bits accuracy: 84.00%
 2 Training with 10000 random digits..........1/2 bits accuracy: 90.65%
 3 Training with 10000 random digits..........1/2 bits accuracy: 92.00%
...
48 Training with 10000 random digits..........1/2 bits accuracy: 99.07%
49 Training with 10000 random digits..........1/2 bits accuracy: 98.90%
50 Training with 10000 random digits..........1/2 bits accuracy: 99.15%

Trained in  30.73s, a total of 12900 value map updates were done

Test accuracy:  96.71% ,   1.21s
Real train accuracy...  99.98%

Training for only 12 batches ends in under 9 seconds with pretty close (~96.5%) test accuracy

Larger SDR sizes/densities push the performance towards mid 97% in 3-4 minutes.

Notes:

  1. 1/2 bits accuracy during training means that each train SDR is searched with a random sample of only 1/2 bits (25 instead of 50). That has two positive effects - speeds up the training time significantly and prevents early overfitting.
  2. Fly-hashing is a simplified alternative to Spatial Pooler which I use to transform digit images into SDRs with fixed size and solidity.