# Reinforcenment Learning & Planning without NUMBERS?

I’ve been wondering for some time how would you implement RL if you only had at your disposal SDR and other brain like processes and techniques i.e. no access to numbers, Qvalues, numeric rewards, max/min algos and other math shenanigans

There is clearly some process resembling RL “inside” our brain. Not to mention Planning too.

So what are your ideas ?
Do you have concrete algos, transformations, processes and examples ?

My idea so far is to first think of REPRESENTATION … the same way TM uses SDR based Variable-Order markov chains, there has to be a sparse representation algo that allows RL&Planning to be expressed in “native” lingo that allow easier more scalable approach.

Pure SDR doesnt seem to fit the bill, because although similarity can be 1-40 bits overlap … the real usable similarity that can be achieved with mixing orthogonal SDR’s is at max 2-3 levels (the same goes for 50% dense Kanerva hyper-vectors) instead of 1-40 (which wont be enough either) .

``````In [14]: (x.a + x.b) / x.a
Out[14]: 20   <=== overlap

In [15]: (x.a + x.b + x.c) / x.a
Out[15]: 13

In [17]: (x.a + x.b + x.c + x.d) / x.a
Out[17]: 4

In [18]: (x.a + x.b + x.c + x.d + x.e) / x.a
Out[18]: 5

In [19]: (x.a + x.b + x.c + x.d + x.e + x.f) / x.a
Out[19]: 3
``````

Similarity seems to be the only way of direct comparison that would allow gradual approach to min/max target, to mimic Q-values and Rewards.
(btw. external rewards shouldn’t exist in my opinion)

So there has to be other way …

1 Like

The below is definitely incomplete and full of holes, but it’s something towards RL and planning. Take it and run with it if you like.

My intuition around this has the following:
Encoders need to be pure functions (programatically/mathematically), so that you can move back and forth between input/output in either direction and have a direct 1:1 mapping between the input and output. This is important.

Now imagine you have a spatial pattern that gets recognized and learned well. Keep this in mind, as it’s also important for the next part.

Imagine that there’s a state machine that’s sitting on the output from your pool, casually glancing at these SDRs, while also tracking some objective/goal, i.e. “Keep the ‘hunger’ value down, keep the ‘happiness’ value up, etc.”… given the above two pieces, if on a certain timestep something happens that results in a change above or beyond an average moving threshold, positive or negative for those tracked goals, take the SDR and stash it away as either a positive or negative factor. Now that we have these stored SDRs associated with either positive or negative impacts on our state goals, we have a very shallow memory. We can then, when needed, take those SDRs, and reverse fire those columns, lighting up the IO space, which in turn should produce a given encoding somewhat similar but not exact to what caused our state to change in a desired manner, which can then be translated back into output.

Expand this out with a limited buffer of previous timesteps’ SDRs, and you may start to have a more established prediction of outcomes on our state machine (which could have its own temporal memory per state goal, where only states that help the goal are strengthened, and others are weakened). This would allow our state machine to potentially intercede in the IO block (in which encoding of input, and translation to output are taking place) as needed to fulfill an objective.

This is definitely not a complete idea, but I think it’s worth sharing and exploring.

1 Like

So you are saying keep a buffer of SDRs generated based on goal and use them as a “thing” to compare against ? like a measuring stick ?

1 Like

Not only as something to compare against, but as instructions for generating system output towards the desired goal, as defined/determined by the state machine. The SDRs can be input, then be recycled again as output, i.e. “This pattern (or series of them) helped advance our goal. Let’s try this.”

If the agent encounters an SDR within this positive sequence again, and there’s a need to fufill that goal (“reduce the hunger value”), then those SDR’s start backtriggering output from the system.

1 Like

but the agent input is State info … agent output is Action … may be they are used internally by State-Action-State model ?

I’m breaking the mold here, but tolerate me for a moment. … pretend that SDRs are both data (state) and code (action), depending on which direction the data is flowing. They are both the result of input from an encoder, and instruction to that same encoder for how to produce an output.

Sure… in essence, that’s what temporal memory already is in my mind. All that’s lacking is that we’re not designing anything (as far as I can see) that takes advantage of that to produce backwards output from the system.

The key here is that encoders are pure functions (in that input is semantically the same as output, and that output from an encoder can be translated back in the other direction), and that the learned spatial pattern can be reverse-fired back into the encoder to produce a translated output.

One implication of this is that using a neural network as an encoder might make it difficult to use in reverse (translating output from the NN to its given inputs).

Here are some TM-models I’ve been thinking about… or combination …f.e. S/A + SA/S chained…

f.e. S/A :
State is stored in TM at every Action row that is ONE …(both are SDRs), so that later State-predicts->Action

(in TM lingo … every column-Y-axis is row in a minucolumn, i.e. for htm ppl rotate this 90 degrees)

```            +-----------+
|           |
Prediction  |           | Action
<------+           +<---+
Action      |           |
+-------+---+
^
STATE +

+-----------+
|           |
Prediction  |           | State
<------+           +<---+
State       |           |
+-------+---+
^
SA    +

+-----------+
|           |
Prediction  |           | Action
<------+           +<---+
Action      |           |
+-------+---+
State ^
Goal  +

```

2 Likes