Reinforcement learning - Markov decision process

Salma · March 11, 2020, 11:32am

Hello,
I am trying to formulate my problem as Markov decision process. {S,A,T,R} .
I have some questions:

when defining states/action :
the state St+1 must depend only on the state St and action at or could depend also on St-1 (when updating probabilities)
2- using Q-learning , if we use the policy when deciding the action, can we still consider the process as MDP (because here we use a memory from past actions)

marty1885 · March 11, 2020, 12:44pm

Hi, welcome to the forum.

If you include the memory itself (the predictive and active cell states) a part of S. Then yes. You can still consider it a MDP. But I’d suggest not to do so as the entire point of memory is to remember things.

You are absolutely right. So pure Q-Learning via HTM is not feasible. But that does not make HTM based RL impossible. Like how you can use LSTM/GTU within a Q Learning agnet to improve performance.

Salma · March 11, 2020, 1:40pm

Thanks alot . I appriciate your reply.
If you may, I have some following questions. so when the system start using the Q-table for deciding the action (unstead of random actions) , it is not an MDP any more?
Because, as far as I know, using Q-learning the probability of an action will depend on previous states/actions .
In fact I want to understand the MDP using Reinforcement learning(Q-lerning). when the system verify the markov property required by the system.

The proporty should be verified only in the exploration phase using random policy or must be verified even when using the optimal policy (via Q-table)?
Using the optimal policy (means that the probabilities depend on privious states). Does this mean that the markov property is not verifyed any more?

Thanks

marty1885 · March 11, 2020, 3:32pm

I’ll try to answer both question together.

In any RL (Q Learning, Deep Q Learning, Actor-Critic, etc…) we generally have a parameter called the exploration rate. The exploration rate is a probability of the agent acting randomly. So the agent have a set probability to act optimally and a set probability to act like an idiot (randomly). This allows the agent to both explore and exploit the environment at the same time.

The Markov part isn’t that important. Assuming a Markov process gives you convent tools mathematically. But you won’t die without it.

BTW, this is a forum for the HTM algorithm and related subjects (neural science, etc…) - not a general ML forum. Please stay on topic.

rhyolight · March 11, 2020, 3:59pm

While it’s true that this forum is centered around HTM, this is the Machine Learning section of the forum. So I don’t mind topics purely about machine learning techniques as long as they stay here. In fact, I rather like the cross-pollination of ideas happening because of this sub-forum.

Bitking · March 12, 2020, 9:00pm

Back in the good old days (2009) Dileep George & Jeff Hawkins were using Markov chains in this paper:

Please note that Karl J. Friston was the editor.

Topic		Replies	Views
Is it possible for a spatial pooler to learn a Markov process? Numenta Theory spatial-pooling , question	5	901	April 12, 2017
ANN: DEW Moving average Lounge	7	884	February 21, 2017
Deep Reinforcement Learning, HTM Numenta Theory	5	1270	May 14, 2016
Reinforcement Learning and HTM Algorithm Machine Learning sequence-memory , encoders , question , community , nupic	26	3559	June 18, 2019
Probabilistic Predictions and Reinforcement Learning Concept in HTM Theory Machine Learning question	9	752	June 16, 2019

Reinforcement learning - Markov decision process

Related topics