Repeating input as timing mechanism

(breaking this discussion off from another topic)

Before I go into repeating inputs as a poor-man’s timing mechanism, let me first give a simple example of where timing is important for a game agent.

Consider the case of a multi-coin block in Super Mario Brothers which is positioned two squares higher than Mario’s head (this is the environment I am using to test my own RL implementation, so it is an easy one for me to discuss). Each time Mario knocks a coin out of the block, there is a consistent cool down period before another coin can be accessed from the block. Additionally, there is a second overall cool down period on the block itself, starting from the first coin being accessed, after which no more coins can be accessed from the block.

If the game agent were to press the jump button immediately upon Mario touching the ground after the first jump, that second jump would result in no coin being accessed. Only every other jump would access a coin from the block, and the time spent during every other non-productive jump would count toward the overall block cool down period, thus resulting in few coins being accessed overall.

In order to maximize the reward from a multi-coin block, the agent must wait a short time after Mario touches the ground before jumping again. Without a mechanism for timing, there is simply no way for the agent to distinguish between the state where Mario first touched the ground versus the state a short time after Mario touches the ground.

In order to give the agent at least some sense of timing, one solution is to use repeating inputs in sequence memory. This is done by feeding TM with inputs at a consistent rate (twice a second, for example). If the state hasn’t changed since the last cycle, the same input is fed to TM. So if input A is Mario falling toward the ground with a block over his head, input B is Mario touching the ground with a block over his head, and Input C is Mario jumping with a block over his head, Mario jumping immediately after touching the ground would be depicted as sequence A B’ C’. Mario waiting a bit after touching the ground before jumping would be depicted as sequence A B’ B’’ C’’. Since state C’’ leads to reward consistently, and state C’ less frequently, the agent can learn it is better to wait rather than jumping immediately.

This works ok for short timing requirements like the multi-coin block example above. It of course doesn’t scale well to scenarios where you need to wait a long time (you can really only learn three repetitions of the same input before it starts being indistinguishable again). This can be countered with pooling, but of course that increases the computational requirement of the system. I’m guessing biology has a more elegant solution, but being a programmer and not neuroscientist myself, I can’t imagine what it is :smile:


For what it’s worth, when gradient-descent based reinforcement learning systems like A3C use recurrent networks, they learn timing simply by repeating input (as the recurrent state will contain information about the repetition as it does in HTM). However, they have it much easier because backpropagation-through-time solves their temporal dependencies for them.

I should point out that my above example included an overly simplified explanation of a learned sequence. In practice, there is a bit more complexity (there is more than just TM involved).

Here is a visual representation of how this fits into my current system (only depicting a piece of the overall system here):


The feature inputs (encoded semantics like ground under foot, block over head, etc) provide proximal input to both a sequence memory layer and a feature/location layer. The active bits from both these layers provide distal input to the motor layer. Thus, the context for the motor layer in the case of a repeating input will have 50% stable cells and 50% in changing context (so semantically similar, but can be distinguished from each other and thus used for simple timing scenarios)

Interesting scenario. I am not sure that you need learning repeated inputs for this though. If the agent just learned a sequence that involves doing unnecessary stuff that would change the state for the required amount of frames, wouldn’t it work too? For example, instead of the agent idling for 4 frames, the agent could just repeatedly change its direction. Aesthetically it would look stupid but functionally it would still learn the required task if there is reward at the end. Repeated inputs don’t seem mandatory. What do you think?

Yes, you are correct in this particular scenario (and probably most others), but there will be cases where the “unnecessary stuff” either comes with a cost (for example I could arbitrarily decide to impose a small negative feedback for every button state change that the agent makes) or the particular environment doesn’t afford such actions (maybe an extremely narrow platform that Mario would fall from if the agent moves him sporadically between jumps).

I should point out that I am not necessarily trying to create the perfect Super Mario Brothers game agent, so it could be that the trade-off in more stable state representations is worth the extra sporadic behaviors that an agent could use to make up for the lack of a sense of timing. Still an interesting topic to explore.

Indeed. Hopefully experimenting with the event based sensor will reveal the necessity of repeated inputs when I have the time.

Another alternative that occurred to me after considering the “doing unnecessary stuff” would be to add a motor action which shifts the perspective of the input (i.e. transforming the location input) without having any visible effect on the agent from a 3rd-party perspective. The agent could use this anywhere it needed to “wait” without actually performing sporadic movements when doing so would require some cost. Implementing that in conjunction with imposing negative feedback on button state changes might be used to improve the aesthetics of the agent in a waiting state.

Borrowing from @curt’s "sensor in a different room” concept from this thread, this could be further simplified by having the motor action shift some other SDR completely separate from its other sensory input. An agent in a waiting state, instead of “looking around”, would in effect be “thinking”.