(breaking this discussion off from another topic)
Before I go into repeating inputs as a poor-man’s timing mechanism, let me first give a simple example of where timing is important for a game agent.
Consider the case of a multi-coin block in Super Mario Brothers which is positioned two squares higher than Mario’s head (this is the environment I am using to test my own RL implementation, so it is an easy one for me to discuss). Each time Mario knocks a coin out of the block, there is a consistent cool down period before another coin can be accessed from the block. Additionally, there is a second overall cool down period on the block itself, starting from the first coin being accessed, after which no more coins can be accessed from the block.
If the game agent were to press the jump button immediately upon Mario touching the ground after the first jump, that second jump would result in no coin being accessed. Only every other jump would access a coin from the block, and the time spent during every other non-productive jump would count toward the overall block cool down period, thus resulting in few coins being accessed overall.
In order to maximize the reward from a multi-coin block, the agent must wait a short time after Mario touches the ground before jumping again. Without a mechanism for timing, there is simply no way for the agent to distinguish between the state where Mario first touched the ground versus the state a short time after Mario touches the ground.
In order to give the agent at least some sense of timing, one solution is to use repeating inputs in sequence memory. This is done by feeding TM with inputs at a consistent rate (twice a second, for example). If the state hasn’t changed since the last cycle, the same input is fed to TM. So if input A is Mario falling toward the ground with a block over his head, input B is Mario touching the ground with a block over his head, and Input C is Mario jumping with a block over his head, Mario jumping immediately after touching the ground would be depicted as sequence A B’ C’. Mario waiting a bit after touching the ground before jumping would be depicted as sequence A B’ B’’ C’’. Since state C’’ leads to reward consistently, and state C’ less frequently, the agent can learn it is better to wait rather than jumping immediately.
This works ok for short timing requirements like the multi-coin block example above. It of course doesn’t scale well to scenarios where you need to wait a long time (you can really only learn three repetitions of the same input before it starts being indistinguishable again). This can be countered with pooling, but of course that increases the computational requirement of the system. I’m guessing biology has a more elegant solution, but being a programmer and not neuroscientist myself, I can’t imagine what it is