AI: reinforcement learning from its own representation of space/time


(I’m purposely trying to use generic vocabulary where I can below, as I understand not everyone here is familiar with deep learning terms… I occasionally slip though. Please forgive me.)

Just saw this video where a reinforcement learning system using deep learning seems to have been able to:

  1. Internally create a representation of the game world that it experienced/observed/was exposed to.
  2. Generate predictions about what comes next in that world (common enough for the past couple years with recurrent neural networks for a few years… it’s improved a lot in the past year though).
  3. Then use its internal understanding of its experience to further train itself and improve.

Where this differs from previous approaches/experiments is that this system instead of only being fed by actual game data, was allowed to be fed by its own memory/generated understanding of the game world that it had previously experienced, then repeatedly generate prediction upon prediction over time… it predicted its own state over time, in a simulation of its own making (hence Siraj calling it “dream environment”… the simulated world was a figment of its own predictions).

What I find interesting about the approach is that it seems like something that should be doable with HTM as well, with temporal memory and grid cells…

Possible challenges:
Encoders that could take/process screenshots?
—>Maybe incorporate some deep learning where the first few layers (which typically learn corners/edges, then higher level features) send their output to an SDR encoder, rather than using any other schemes to convert images to SDRs?

How to encode a “reward”?
—>maybe have an encoder with small bit space which receives feedback from the simulation/game about game state, allowing HTM system to associate different input states with different game states?

Number of passes over initial input data before inducing “feedback” (feedback == output predictions of HTM system back into itself…
—> Researchers in the mentioned paper’s case allowed their convolutional neural network to first self-create a dense representation/compression of the world states. Unsure if HTM could do this, or if this representation is what might be fed into the HTM, which would essentially replace the memory unit in the referenced experiment.

Am I mistaken in my understanding of HTM’s abilities, or should this be doable? Current deep learning models for “World Models” seem to first require a ton of data in order to learn, again, via back propagation. (It should be noted, that in order to perform well, any DL system requires a lot of data examples…)


I’d say it is doable. I have tried replaying stored sequences internally based on the reward they yielded so that salient events are stored stronger. Or in other words, if something very important happens, solidify it by playing it again and again internally. This is important because the rewards that surprise the agent should actually be rare for a well developed agent. So you need to have some sort of one shot learning and internal replay which allows exactly that.

One other implication of internal replay is discovering novel paths within sequences. This can be refferred to as hippocampal replay in neuroscience. Mice do this by internally replaying events in different order for example to create a Tolman map of the environment. This navigational thing can be extended to any task.

The main challenge with HTM was the functioning of adaptation mechanisms. How do you keep feeding the layers with internal replays and still abide by the statistics of real events? How do you for example configure boosting, bumping or activation frequencies during these replays where you cherry pick some activations. During a replay, there are newly created segments and synapses even deleted ones. HTM synaptic permanences are strengthened statistically and this internal replay messes with them a lot. On the other hand, the only way to implement learning by any sort of internal replay is through messing with them. For example, sometimes competition mechanisms result in a very small instability that introduces a new column to an old representation. If internal replay happens exactly at that time, that instability is amplified greatly.

Here was the real problem. I assume the agent should not store the actual sensory inputs (can it biologically?). So you can only replay with the stored previous neural activations (short term episodic memory). Let’s say you have A->B->C->D representations occuring in a sequence. You start the internal replay and strengthen A->B. Segments of activation B are modified. So if you cut the replay here, the next time A occurs in reality, not the exact same B may be predicted. But we are trying to replay the full sequence, so let’s continue. B->C happens internally, now the same thing happens to the segments of C but also this time even your supposed input activation of B is different to what the current system may predict. At the end of a full replay you end up breaking the sequence a lot of the time. You modify every step of the sequence which may prevent the whole sequence occuring in the first place. So I tried doing reversed replays in an extending fashion. For A->B->C->D, you start with C->D, then B->C->D, then A->B->C->D. This allowed the system to not break a sequence during replay because every prior step of the sequence would be adapted to the new representation of its next step. However, now you have a very different A representation that the real input may not result in. I decided to use my energy for other parts of the system at some point.

TL;DR; It is doable and worked to some extent but requires so many hand crafted tweaks to the system which still ends up breaking some sequences. Rather than having semantic pointers to recent cortical activations to replay them later, the only solution seemed be storing a carbon copy of the whole system including the sensory inputs. However this did not seem to be bioplausible and not the way hippocampus does it from the research that I have read on hippocampal replay.