What does it mean to attach value to a SDR? the best way I can put it is with an example.
Let’s say we have a reinforcement learning agent that uses SDRs to encode observed state in its environment. It can be based on HTM or other type of predictor that uses SDRs as an underlying representation.
Now in the RL environment, in certain situations the agent encounters either positive or negative rewards. One problem with TM or other sequential state predictors is that rewards/penalties usually are received within few states, which can be very sparse within time domain.
A value map is useful to infer value from current and predicted states.
More precisely, for an agent that can
predict(current_state, action) -> next_state
it would be useful to learn a notion of how valuable or important might be an immediate future state that can be predicted from current state and the palette of available actions.
A more concrete example is gym’s CartPole - every time the cart falls off the table or the pole tilts more than 20 degrees, the current round is ended with high negative reward.
An HTM-like memory can be pretty good at modelling physics with short-term consequences of small actions, but it could be quite difficult to learn longer-term ones.
The fact that a finer-grained time encoding (e.g. double observations/second) benefits accuracy at the physical model is counter balanced by the consequence of doubling the number of steps between relevant events.
Moving further with cart pole example:
If the reward of falling is -100, then we can mark the previous 10 steps with an increasing absolute value level, so “danger score” for state SDR observed 5 steps before falling would be -50
Then we can write these scores into a value map and at every future time T, the TM knowing “physics” can predict possible states at T+1 for every available action, and the agent algorithm can query the value map to pick the least dangerous action.
An interesting property of a value map is its permanency. Once the agent learns to stay upright the danger map remains unchanged and “vigilant” even after a million successful rounds.
Even more interesting is that for a highly “dangerous” (or valuable) state, agent can figure out which of the state bits represent a specific value or danger and this is an amazingly powerful feature because allows it to rewrite an automatic policy that both identifies particular “dangers” and directly maps them to “recommended actions” in a simpler, quick-reaction specialist “autopilot”.
Another use for a value map could be as guides to exploring/learning environments - if a value map for “novelty” is simply incremented for every new state, then later the agent can use it both as a sensor for familiarity of current states and search for unexplored spaces within its environment.