So how is this progressing? Are you able to test your theories on some test bed? I’ve been away for a couple of weeks with my head under the sand attacking similar problems.
This is what I am going with at the moment. You have the distal depolarization caused by the sensory context and the apical depolarization caused by the reward circuitry (reinforcement layer). There is a main layer in my architecture that takes apical input from reinforcement layer and distal input from sensory layer. Reinforcement layer is fed by this main layer (both distal and proximal dendrites) and TD lambda values are computed on the resulting neurons of reinforcement layer. So the activation on reinforcement layer represents the “value” of the activation in the main layer using its context too. Unlike normal temporal memory, the apical dendrites between these two are modified with the error signal of the TD (strengthen/create if the error is positive and weaken/destroy if it is negative). Reinforcement layer depolarizes main layer based on reward and sensory layer depolarizes main layer based on sensory information for the next step. If both depolarization overlap, those neurons are activated which are connected to some motor outputs. Additionally, I have two reinforcement layers mimicking the Go and No-Go circuits in Basal Ganglia (Striatum D1 and D2). The synapse adaptation rules are the exact opposite. So the resulting activation on the main layer avoids and activates motor commands simultaneously. The main idea was to have the capability to also avoid things. The experiments I conducted lead me to believe that without some sort of avoidance you cannot reduce the search space to try for better motor commands among all the unnecessary ones. As result the agent gets stuck in behavior loops trying to get rid of learnt unnecessary (sometimes cyclic) actions and rebuilding them some time later.
A potential pitfall with a “chosen” activation is trying to make it meaningful. A chosen activation would have columns from a lot of different “real” activations. As a result, you either have to converge this “imaginational” activation to the closest “real” activation or somehow resolve the conflicts on this chosen activation because there will be contradicting/implausible groups of columns on the same chosen activation.
The solution I went with is to use the reward signal (error in TD lambda) in a different way. Rather than using state values to boost columns, the error signal can be used to create and modify synapses between motor/sensory layer and reinforcement layer. It becomes more complex but gets rid of the problems introduced with chosen activations.
By the way I could not really grasp the exact architecture of yours from the diagrams. For example does the reinforcement layer’s sole input come from motor layer? If so what is the reasoning?
Thanks for sharing your progress.