 # Reinforcement learning - Markov decision process

Hello,
I am trying to formulate my problem as Markov decision process. {S,A,T,R} .
I have some questions:

1. when defining states/action :
the state St+1 must depend only on the state St and action at or could depend also on St-1 (when updating probabilities)
2- using Q-learning , if we use the policy when deciding the action, can we still consider the process as MDP (because here we use a memory from past actions)

Hi, welcome to the forum.

If you include the memory itself (the predictive and active cell states) a part of S. Then yes. You can still consider it a MDP. But I’d suggest not to do so as the entire point of memory is to remember things.

You are absolutely right. So pure Q-Learning via HTM is not feasible. But that does not make HTM based RL impossible. Like how you can use LSTM/GTU within a Q Learning agnet to improve performance.

2 Likes

If you may, I have some following questions. so when the system start using the Q-table for deciding the action (unstead of random actions) , it is not an MDP any more?
Because, as far as I know, using Q-learning the probability of an action will depend on previous states/actions .
In fact I want to understand the MDP using Reinforcement learning(Q-lerning). when the system verify the markov property required by the system.

1. The proporty should be verified only in the exploration phase using random policy or must be verified even when using the optimal policy (via Q-table)?
2. Using the optimal policy (means that the probabilities depend on privious states). Does this mean that the markov property is not verifyed any more?

Thanks

I’ll try to answer both question together.

In any RL (Q Learning, Deep Q Learning, Actor-Critic, etc…) we generally have a parameter called the exploration rate. The exploration rate is a probability of the agent acting randomly. So the agent have a set probability to act optimally and a set probability to act like an idiot (randomly). This allows the agent to both explore and exploit the environment at the same time.

The Markov part isn’t that important. Assuming a Markov process gives you convent tools mathematically. But you won’t die without it.

BTW, this is a forum for the HTM algorithm and related subjects (neural science, etc…) - not a general ML forum. Please stay on topic.

While it’s true that this forum is centered around HTM, this is the Machine Learning section of the forum. So I don’t mind topics purely about machine learning techniques as long as they stay here. In fact, I rather like the cross-pollination of ideas happening because of this sub-forum.

5 Likes

Back in the good old days (2009) Dileep George & Jeff Hawkins were using Markov chains in this paper:

Please note that Karl J. Friston was the editor.

4 Likes