Branching predictions

This will be the first of a series of threads focused on tweaking existing HTM concepts for use in implementing some of the capabilities involved in a reinforced learning system. The methods here will deviate from how these types of systems are designed in biology – the purpose is to see what can be done with the concepts that have been distilled so far, and hopefully gain some useful information in the process.

One of the core capabilities that must exist in a reinforced learning system, is the ability to differentiate between possible next steps, so that they can be evaluated and a choice can be made on which to execute. For the purpose of this thread, we will assume the logic behind making a choice and executing on that choice already exists (I’ll go into those specific concepts in another thread). Instead, I would like to discuss how to identify and differentiate between next possible actions.

To put this in terms of the normal temporal memory process that we are all familiar with, imagine you have an HTM layer has been reset (all cells and their associated structures are inactive). This layer has previously learned the following simple sequences:

A-B-C-D
A-C-D-F

At this point, you input “A”, and cells for both “B that follows A” and “C that follows A” are now predicted. What you have is basically a branch with two predictions about what will come next

 A
/ \
B C

You now want to determine which of the predictive cells are associated with different potential next inputs (versus just having a bag of predictive cells without any knowledge of which of them go with which input). You need to be able to make this distinction if you want to evaluate the value of one potential next input versus the other.

Of course the above is a simple case, and could be handled by simply labeling which columns go with B and which go with C. However, if you look at some more interesting applications, each of the next steps will probably involve a combination of actions, and simply distinguishing the individual actions from each other is not enough to be able to identify what are the possible next steps. Let me give a more interesting example.

Let’s say we are trying to design a system that uses HTM concepts and reinforced learning to play Super Mario Bros. The actions that are taken in this game must be done in combination. For example, to run left, you must press both the B button and the Left arrow. To jump right, you must press both the Right arrow and the A button. Simply knowing the difference between A, B, Left and Right is not enough to take an intelligent action in the game. It must be possible to learn combinations of actions.

Say for example, we happen to be in a state where Mario is standing on a narrow ledge with a pit on either side:

Let say from past experience, the system has learned that there are a couple of possible next actions that can happen in this scenario – Mario can jump to the right (Right arrow plus A) or he can jump to the left (Left arrow plus A). Before the system can evaluate which is the better option (jumping left to get the mushroom, for example), it first must be able to distinguish between the two options.

In an HTM system, the relevant columns in this state might look something like this:

As you can see from this, it is clear that the next possible steps involve the Left arrow, Right arrow, and A. But it is not clear how those actions are related to each other. What we need is some way of associating the predictive cells such that we can see that there are two predicted next actions:

This could be even a bit more complicated than the above, in that some of the cells might be shared between the two options (such as if the system were using one cell per column for the layer which is predicting the next actions). For this reason, the association should probably be between segments (versus associating just the cells)

One way this could be done is to add a new entity to the HTM system, for associating all active segments in the system at each step. We reach the point in the temporal memory process where cells that were in predictive state transition into the active state, any new segments have been added, and permanence values have been adjusted. At this point, we can inject an additional step, where we check if some percentage of the active segments are already associated with each other. If not we add a new association. If there is already a close enough match, that one is picked and adjusted.

Within an association, we can adjust a new set of permanence values to allow the association to learn and adjust over time to better match the input stream. At each time step, we will also know which association was chosen in the previous step to get to the current one. That then allows us to adjust the actual value of making a decision (the logic behind how that value is determined will be the topic for another thread). The value might be stored on the association entity itself, or it might be used to drive predictions in another layer of cells that encode the value (either of which can be used in future choices when a similar state is encountered again in the future).

Obviously this is still a very rough theory, so looking for folks to poke holes in it as a way of hopefully distilling out something useful.

From your example of Mario’s current situation, I think the relevant columns in that state would actually look something like this:

With my understanding of HTM systems, inputting “A”, “Left Arrow” and “A and Left Arrow” are all distinct patterns of actions at one moment in time and cells for “A” and “Left Arrow” wont necessarily be shared with “A and Left Arrow”.

However, I may be misunderstanding you. Are you referring to at one time step “Left Arrow” is inputted and immediately after “A” is inputted and the union of these two choices are the options of the system?

Sorry, I think in my attempt to simplify the idea and avoid bringing up another topic that I want to dive into a more detailed discussion on another thread, I don’t think I really captured how I was practically considering using this system. The main problem with this depiction here is that it implies the columns contain cells depicting the current state, intermixed with cells depicting motor commands – i.e. that the current context is a combination of both inputs and outputs.

It is possible that is how how biological systems operate (I don’t know a lot about biology unfortunately), but without any further knowledge, I would connect things a little bit differently. Combining inputs and outputs into a single layer makes it difficult to deconstruct just the actions later if I want to use what is encoded in temporal memory to help drive motor commands. The system I am envisioning would use 3 separate HTM layers. One of the layers would be used to predict the “goodness/badness” of each state, to use in reinforced learning decisions. I’ll ignore that third layer for now, and focus just on the the first two layers.

The first layer would be a typical HTM layer. This layer would learn temporal sequences about the environment. In the case of Super Mario Bros, it might need to capture things like the position of Mario, relative location of map features like pits, blocks, and platforms, and relative position and movement direction of mushrooms and enemies. How all this would be encoded to capture semantic meaning is another topic.

The second layer would be set up a bit differently. It could be a single cell per column (or have all cells in the column always burst), because it would only need to learn low-order sequences. It could also be a lot shorter than 2048 columns in the case of Super Mario Bros, because the number of unique motor commands is so small. The columns in this layer would have proximal connections with the motor commands being input into the system. When I press any combination of buttons, some columns in the second layer would become active (i.e. controller button input is fed through the second layer’s spatial pooler). The sparsity would not be fixed (pressing two buttons would be twice as dense as pressing one button). The purpose of the proximal connections is that they would allow me to set up the system so that it can watch and learn as I play Super Mario Bros.

Another difference between this second layer and a typical HTM layer, is that the cells will not grow distal connections with other cells in their same layer. Instead they will grow distal connections with cells in the first layer. Thus when controller input happens, cells in the second layer become active, and they grow distal connections with cells that were previously active in the first layer. Thus, as the first layer is predicting what context is going to occur next, the second layer is simultaneously predicting what controller actions are going to occur. This is the reason why I say it will only need to learn low-order sequences (Context --> Motor Commands).

Here is a graphical depiction of the system described above:

Hopefully that clarifies things a bit. Because I was trying to avoid bringing up multiple layers, I think I ended up creating more confusion with my depiction of active and predictive cells.

Here is a new version of my earlier drawings, closer to how I am hoping to use this idea in practice.

I think I misunderstood your point when I answered you yesterday. What you are suggesting is that the combinations of inputs be semantically dissimilar to the individual inputs by themselves. Input “A” would not share any overlap with input “A + B” for example.

The only issue I see with that idea is it would require devising something which converts inputs to columns but intentionally scrambles any semantic similarities (versus preserving them like the spatial pooler). Given an input with 50% similar active cells compared to another input, the goal would be to generate a new, reproducible representation that had no overlap with the other input.

Of course in the case of Super Mario Bros, there are only 64 possible different input combinations (if we ignore “Start” and “Select” buttons, and the Player 2 controller), so they could probably just be built out manually. That would probably be the simplest solution for a system that plays NES games specifically, but in general I like to think about solutions that aren’t so bound to a specific problem (one of the reasons I got hooked on HTM).

Another way would be to encode “A that occurs simultaneously with Left Arrow” as different set of cells in the “A” columns than an “A that occurs by itself” or an “A that occurs simultaneously with B and Left Arrow”. To accomplish that, you would need to use a something like the system I talked about in my thread Learning a collection of features where order doesn’t matter. I think I’ll look into this idea as well… seems like it could work.