I have been working on a test to implement reinforcement learning using HTM concepts. I am still debugging some problems (with the implementation, not necessarily with the theory itself) so I am not ready to release the code just yet. But I thought I would discuss the approach I am using, and maybe get some feedback and ideas. There are a few aspects which are not biologically feasible, but I am hoping some useful information might be distilled from this exercise.
When I refer to Layer X “projecting to” Layer Y below, what I mean is that cells in Layer Y grow distal connections to cells in Layer X, rather than to cells within the same layer. I may be using this terminology wrong, so please correct me if this causes confusion.
At a high level, the basic idea is to have three high-order sequence memory layers (multiple cells per column). The first layer learns patterns and context from sensory input, including positional information (see some of my other theories on the forum here for my thoughts on that). This layer projects to the second layer. The second layer receives input from motor commands, and projects to the third layer. The third layer receives reinforcement input (reward/ punishment).
Each layer’s specific inputs pass through its own spatial pooler to select active columns. In other words, different columns are active in each layer at any given time. In the second layer, the columns represent the combination of all motor commands at a given point in time. In the third layer, rewards and punishments are simply encoded as scalars.
The columns and cells in the three layers are connected like so:
With this setup, the first layer can make inferences about what sensory information will come next based on the current context. Columns represent the input, and cells within the columns represent the context.
The second layer makes inferences about what motor commands will come next based on the current context from the first layer. Columns in this layer represent the motor commands, and cells within the columns represent the sensory context.
The third layer makes inferences about rewards or punishments that will come next based on the current context from the second layer. Columns represent the reinforcement, and cells within the columns represent the sensory-motor context. The columns in this layer are divided into three groups: reward, punishment, and novelty (I’ll explain novelty in a bit).
The reason for a difference between cells for reward and those for punishment in order to allow for both positive and negative reinforcement (versus positive reinforcement only). When evaluating a particular action, the system must be able to predict both the positive and the negative future outcomes (otherwise the union of predicted outcomes would always represent a positive representation, and system could never learn to avoid negative experiences)
How good or bad a particular set of motor commands is weighted in a given context consists not just of the immediate reward/punishment in that state, but also the predicted rewards/punishments of possible next actions. This allows the system to take actions which might have an immediate negative result in order to achieve a future positive result.
Active cells depict the immediate reward or punishment, and predictive cells depict future rewards and punishments from a given context. The density of predictive cells indicate the level of rewards and punishments. This is roughly equivalent to Backward View TD(λ) (not exactly, but the concept is similar). For more info about Backward View TD(λ) from a mathematical perspective, see https://youtu.be/PnHCvfgC_ZA starting at 1:30:26 (and subsequent videos in the lecture go into it in further depth).
Besides rewards and punishments, I have also introduced the concept of “novelty”. These columns represent the level of unknown outcomes a particular action might lead to (i.e. future actions down a particular path that the system has not yet tried). The purpose of this is to allow the system to explore actions it hasn’t tried yet, versus always only ever going with the very first positive action it has done in a particular context.
The system will have a curiosity level that grows over time, and is reduced any time it does something novel. The more novel a path is, the more the system’s curiosity is satisfied. A combination of novelty score and curiosity level can eventually outweigh punishments that the system has encountered in the past, and cause it to try a particular action again in order to explore subsequent actions down that negative path that it hasn’t tried yet (and which could lead to rewards).
A breakdown of the process:
- SP process to select active columns in the first layer
- TM process (Activate, Predict, Learn) for cells in the first layer
- Increment curiosity level
- Upon motor commands, SP process to select active columns in the second layer
- TM process for cells in the second layer (distal connections with cells in the first layer)
- Imagine possible actions (simulate activating input combinations without learning and compare predicted reinforcement)
- Novelty weight increases with curiosity level
- Choose action with highest score (novelty * curiosity + reward - punishment)
- Upon reinforcement, SP process to select active columns in the third layer
- TM process for cells in the third layer (distal connections with cells in the second layer)
The system could be used in an online learning mode, where it initially chooses novel actions and receives reinforcement, then uses that to take better actions when it re-encounters semantically similar contexts.
The system could also be trained in a supervised mode, where it doesn’t take any actions of its own, but learns by observing (for example a human user). A combination of the two could also be used (first being trained by a human user, then enabling the ability to take actions on its own).