This will be the first of a series of threads focused on tweaking existing HTM concepts for use in implementing some of the capabilities involved in a reinforced learning system. The methods here will deviate from how these types of systems are designed in biology – the purpose is to see what can be done with the concepts that have been distilled so far, and hopefully gain some useful information in the process.
One of the core capabilities that must exist in a reinforced learning system, is the ability to differentiate between possible next steps, so that they can be evaluated and a choice can be made on which to execute. For the purpose of this thread, we will assume the logic behind making a choice and executing on that choice already exists (I’ll go into those specific concepts in another thread). Instead, I would like to discuss how to identify and differentiate between next possible actions.
To put this in terms of the normal temporal memory process that we are all familiar with, imagine you have an HTM layer has been reset (all cells and their associated structures are inactive). This layer has previously learned the following simple sequences:
A-B-C-D
A-C-D-F
At this point, you input “A”, and cells for both “B that follows A” and “C that follows A” are now predicted. What you have is basically a branch with two predictions about what will come next
A
/ \
B C
You now want to determine which of the predictive cells are associated with different potential next inputs (versus just having a bag of predictive cells without any knowledge of which of them go with which input). You need to be able to make this distinction if you want to evaluate the value of one potential next input versus the other.
Of course the above is a simple case, and could be handled by simply labeling which columns go with B and which go with C. However, if you look at some more interesting applications, each of the next steps will probably involve a combination of actions, and simply distinguishing the individual actions from each other is not enough to be able to identify what are the possible next steps. Let me give a more interesting example.
Let’s say we are trying to design a system that uses HTM concepts and reinforced learning to play Super Mario Bros. The actions that are taken in this game must be done in combination. For example, to run left, you must press both the B button and the Left arrow. To jump right, you must press both the Right arrow and the A button. Simply knowing the difference between A, B, Left and Right is not enough to take an intelligent action in the game. It must be possible to learn combinations of actions.
Say for example, we happen to be in a state where Mario is standing on a narrow ledge with a pit on either side:
Let say from past experience, the system has learned that there are a couple of possible next actions that can happen in this scenario – Mario can jump to the right (Right arrow plus A) or he can jump to the left (Left arrow plus A). Before the system can evaluate which is the better option (jumping left to get the mushroom, for example), it first must be able to distinguish between the two options.
In an HTM system, the relevant columns in this state might look something like this:
As you can see from this, it is clear that the next possible steps involve the Left arrow, Right arrow, and A. But it is not clear how those actions are related to each other. What we need is some way of associating the predictive cells such that we can see that there are two predicted next actions:
This could be even a bit more complicated than the above, in that some of the cells might be shared between the two options (such as if the system were using one cell per column for the layer which is predicting the next actions). For this reason, the association should probably be between segments (versus associating just the cells)
One way this could be done is to add a new entity to the HTM system, for associating all active segments in the system at each step. We reach the point in the temporal memory process where cells that were in predictive state transition into the active state, any new segments have been added, and permanence values have been adjusted. At this point, we can inject an additional step, where we check if some percentage of the active segments are already associated with each other. If not we add a new association. If there is already a close enough match, that one is picked and adjusted.
Within an association, we can adjust a new set of permanence values to allow the association to learn and adjust over time to better match the input stream. At each time step, we will also know which association was chosen in the previous step to get to the current one. That then allows us to adjust the actual value of making a decision (the logic behind how that value is determined will be the topic for another thread). The value might be stored on the association entity itself, or it might be used to drive predictions in another layer of cells that encode the value (either of which can be used in future choices when a similar state is encountered again in the future).
Obviously this is still a very rough theory, so looking for folks to poke holes in it as a way of hopefully distilling out something useful.