I’m going through these lectures https://www.youtube.com/watch?v=2pWv7GOvuf0 and was wondering if anybody had thought about how RL and nupic would work together? Is this for later?
Numenta is not investigating this because of our focus on biological algorithms. But I think it is a good idea, and I know that some of our forum members have been working in this direction. (I’ll let you identify yourselves if you are interested in talking about it.)
If we follow the biological algorithms path, I was wondering if reward and value (for estimates) could be fed in as part of the SDR? The same way, actions are fed back as part of of the input SDRs in the motor portion of the HTM algorithm? I am not saying this would now be equivalent to RL, but wondering if this could steer the prediction or anomaly detection in some way?
RL is an important concept to incorporate into HTM, but we’re not quite there yet. Sensorimotor comes first, then we can dive into how an agent is rewarded for motor behavior.
Without any knowledge of how it is done in biology, my thought is this shouldn’t bee too difficult to implement with existing HTM concepts if there was a way to represent a sum of both positive and negative rewards in an SDR representation (such that from the representation I could determine a level of positivity / negativity – perhaps with level of overlap with two “maximum positive” and “maximum negative” SDRs).
First, you would need a normal high-order sequence learning layer (your typical HTM layer, 2048 columns, 32 cells per column) which learns contexts (this is a large component of the “Markov state”).
Second, you would need a second layer to encode motor commands. The composition of this layer(s) is where Numenta is currently focusing their attention now, but currently I am implementing this for a simple system I am working on as a low-order sequence learning layer (i.e. a layer with one cell per column, or all cells in column always burst) that becomes active when motor commands are executed. It grows distal connections with the high-order sequence learning layer, such that when a particular point in a sequence is encountered, a state of motor functions is predicted.
Then you would need to generate the aforementioned SDR for each time stamp, and encode it in a third, low-order sequence learning layer. The cells in this layer would grow distal connections to the other two layers, such that when a particular context is re-encountered with the motor commands that were executed in that context, a “positivity + negativity SDR” would be predicted. The low-order sequences that would need to be learned are “context -> motor commands -> positivity/negativity SDR”.
– EDIT – Forgot to mention here, this layer would obviously need to generate connections over multiple time steps, so that the future SDRs being predicted encode a positivity/negativity score for multiple steps into the future, not just the next step.
Ultimately you would have a system where input a context, you could predict various actions, each of which could predict a positivity/negativity SDR. That could be used to generate a score for each action, and pick the one with the highest score.
Obviously this is purely hypothetical, and needs some actual testing to see if it would work. I happen to be working on a project where I could use a system like this (a robot that can learn to navigate a maze). I was initially going with a very different approach for driving the robot to the goal (using synapse permanence increases/ decreases to remember good behavior and forget bad behavior), but I think I’ll spend some time to see if I can get something like this to work instead. RL looks like a much better solution (it’s probably not a good idea to simply forget bad experiences – remembering them might help to avoid them)
Here’s a naïve approach to integrating Q-learning I did a while ago:
http://htm-community.github.io/sanity/demos/q_learning_2d.html
It reinforces actions taken from a state as if they were associations (or transitions). One problem with that is that there is no remembered concept of a bad state (or bad action). Instead that association has been punished into oblivion, so there is no concept at all, no ability to imagine a bad path.
Anyway I agree with the others that there is not much point working on reinforcement learning until we have a good unsupervised model.
I read a blog where author tried to combine HTM and RL. He wasn’t using Nupic, but the idea of HTM.
First blog posts:
https://cireneikual.com/category/uncategorized/page/2/
Some related blog posts continuing where previous ends.
Yes, that’s @ericlaukien. He won a prize for this at the HTM Challenge: