Reward hacking in simple HTM agents (using OpenAI Gym)

I’ve spent some time trying to apply traditional reinforcement learning algorithms in HTM (and debug PyEtaler in the meantime). Anyway, I find something interesting. In short, a simple Q Learning -ish agent based on HTM is still prone to the age old problem of reward hacking even with HTM’s biological components.

I built a agent using a single Spatial Pooler, which acts like a Q table. The SP learns the relation of it’s environment, motor commands and rewards. There’s no TM and recurrent signals in this simple agent. Then the gent is sent into OpenAI Gym’s FrozenLake-v0 environment with a modified reward to incentivize HTM; since long term reward tracking is still not yet possible.

After a few tries. The HTM agent learned to hop around in the same spot, accumulating small rewards over time. (Since the reward discourages dying/staying at the exact spot but rewards being alive.) But the agent never explores even with a high boosting strength.


Are these kind of problems can be fixing using TBT? Or other mechanisms have to be in place?

Source code:


I’m not an expert on RL algorithms, but maybe this is a difficult behavior to eliminate without long term reward tracking. Given simple rules like “don’t stand still” and “stay alive”, it seems like that pacing around in a safe location is the only logical conclusion if the agent is not able to consider more than what is right in front of it (not sure if that is the case for the algorithm you’re using, though).

Adding temporal memory might help, since it would allow an agent to model more complex states (high order vs low order). Not sure this helps with the long term reword tracking though. Forming something like “episodes” that can aggregate lots of small rewards and punishments is probably the way to handle that (temporal pooling may be a simple/naive way of implementing that).

Another thing I’ve been thinking about for a while is that, compared to traditional RL strategies, biological agents have another detail which probably drives more complex behavior – competing needs. In the real world there is never a simple global “reward signal” that an agent can focus entirely on maximizing. Staying alive by running in place may be a great strategy, until you are starting to dehydrate and have to venture out into potential danger to find water.


In the subcortex there are multiple nodes, each evaluating current needs and asserting a goal state based on which one screams the loudest. For example thirst and hunger both drive action but you might ignore food if you are really thirsty. After the thirst is satisfied the hunger wins and you turn to finding food.
Maslow describe a pyramid of needs and as you satisfy the lower needs the higher level goals are able to be activated.In the post on processing in the old brain, I build to a global goal state, driven by inputs from local sub-processors.


Would it explore more if it took a random action 90% of the time?

When I’ve played around with RL, some algorithms will start with 90% random actions, and 10% “do what the model predicts” and slowly over your training session, scale down the random percentage, and scale up the predicted action percentage. This might help avoid overfitting on a local reward early on.


In communications software that worked on noisy channels. I would suggest something not too rigid with a behavior of its own.
In my case it varied in a way that probes and correlates to the problem(s) it was solving.
The net effect (intent) was a variable speed dependent upon the connection quality. Extreme fault tolerance.

To clarify. Upon X successive errors the setting would step until it hit the minimum. Successful sends would gradually speed it up.
Successive failures would lead to other settings being changed and possibly restarts.
So if 10% of the time you do a random action it’s range or details should cover the problem space.

I was attempting to not forcing a 90% exploration rate - it’s not biological and hoped boosting does the trick. But 90% exploration does work.

I didn’t add a TM to the agent because the world state is static and nothing is hidden to the agent. So there’s no need of memory. Hm… how could HTM perform reward tracking? I’ve been trying but it seems very difficult with the standard SP/TM and local learning rules. (And tracking rewards as a real number is also tricky)

That’s a very good point!

The standard algorithms couldn’t, but a temporal pooling algorithm possibly could. An “episode” could be implemented as an “object” of sorts, where some of its features can include accumulated rewards and punishments. Actions, if stored in temporal memory, could possibly also be unfolded from it using a temporal unfolding algorithm. This is all just theory, though. I’ve been talking about it for a while, but I really need to implement it before I can say whether it is a very good theory or not.


I think we need to break up our concept of “columns” and “cells” just a bit, moving only slightly away from classical HTM. If we have a small 2-layer pyramid of shallow pools, I think we can create a memory-enabled state machine that biological-ish.

First layer would be connected directly to input space like we’re all familiar with. It would have both SP and TM turned on (maybe boosting as well… play with it). Sitting on top of that would be multiple mini pools, also with shallow columns. Each pool represents one possible state change, so one pool for health/energy increases, one for decrease, distance from target, etc… their input space is the first pool. Their learning rules would only strengthen connections when their desired/tracked state changes in the desired manner (find an apple, hunger goes down, remember which lower columns were active to achieve that). The pyramid’s goal is to learn which input patterns lead to which changes of state. The TM of each of the state would be constantly trying to
guess how to best increase that particular state.

So we have our pyramid. Next we need our goal tracker/IO modulator… akin to the thalamus… I’ll just call it that (though it isn’t perfect). This is where we set the intended priorities of the system, such as “eat, move left, sleep, etc”. It also has a very small and shallow pool, sp/tm, which receives and outputs the encoding of which state pool it allowed to activate the previous timestep, its desired goal, current stats levels, etc. (all of which is also concatenated to the initial pool’s input space). It enables upper pools in the pyramid to fire downward, activating output from columns in our first pool.

Finally, we have our IO block. When our thalamus triggers a certain pool (or pools) to fire, those pools activate the lower pool’s columns to light up, activating their connections into the OUTPUT space (which is separate, but otherwise a clone of the input space in its dimensions and connections to base systems).

The key part of this is that the encoders for our encoding space know how to translate an encoding back to an output value. In essence, an encoder given an input for an encoding, should produce that same(ish) value as an output. As multiple columns are firing and lighting up their connections in the IO space, those connections might need to cross a certain threshold count before being considered for de-encoding to an output, in order to eliminate some noise. In my mind, the same(ish) encoding/deencoding is the hardest part.

I imagine something hooked up to a system like this would freak out for a bit, like having a seizure, but would eventually level out and modulate itself towards whatever goal was set up in the thalamus block.

I’d code this, but I’m trying to get some stuff off the ground on my side. Bug me if my idea above doesn’t make sense or isn’t clear enough.


Interesting archicture. Would you mind me trying to implement it in Etaler? I’m currently working on rearchitecturing Etaler to support large systems.

I assume something like a Spatial Pooler but with multiple cells per column? How would this learn delayed rewards? Like pressing a button causes the room to light up 20sec later.

What is this pool learning? May a small pool be not capable of learning such amount of data?

Interesting, if the output is trying to predict the input of the next time step, does that mean the input from sensory organs are also predicted?

Yes, please! I put my thoughts here for anyone/everyone to try. I claim no ownership, and give them to the community as a whole. If it does turn out to be the greatest thing ever and somebody tries to patent it, I’ll chase them down and leave a bag of feces on their front porch, though.

They would learn trends and general directions, rather than full-on delayed rewards like you mention… yet if it also has TM, I may eventually pick up successful “paths” through the problem space. This idea is definitely rough.

I’m drawing inspiration from the brain, where no single piece of the brain has the full picture of everything. Our thalamus is somewhat like the neocortex in parts of itself, but if we think about what information it’s getting, it’s getting:

  1. Input information from various lower systems of sensory input
  2. Output from the neocortex
  3. Output from other parts of itself that manage other types of input/output --> connects internally with different thalami nodes managing audio, visual, somatosensory signals
  4. IO from other parts of the ancient brain…

To me and my potentially uninformed view, the thalamus looks like HQ of the semi-conscious mind, the observer of most IO, and, when necessary, the director of communications and work projects.

Somehow (Hebbian learning?) it manages to learn how to modulate/coordinate these different signals depending on different needs, such as shutting off background noise in a room so that you can focus on the person talking in front of you, or letting “us” know we’re hungry and need to start planning how to fix that. In essence, the thalamus MUST be learning which IO goes to which area, as well as basic relationships between the input/output passing through it. So setting it up as a SP/TM object, of unknown dimensions, seems like the logical first step, or at least one that’s familiar/known.

It’s also likely, as in our brain, that various different “pyramids” (macro columns, really) would all be connecting through the thalamus, thus another reason to keep pools and connections fewer rather than denser.

What sort of depth we give it in this type of architecture, we’d probably have to experiment a bit with it, but when in doubt, let the constraints of biology be our guide.

Also, Just a random thought while writing the above… inputs from the state-change pools into this thalamus might also be in the form of a SUM’d surprise/anomaly score encoded as a scalar (or an inverse of it, to show when the outcome was predicted successfully or not).

Yes, in the both the initial input pool, as well as by the state-change pools above it. Whether or not those predictions are used to produce output would be decided by the modulator (our thalamus) in this architecture. These, in conjunction with our thalamus, learn to guide the whole system as to which output(s) should be activated to correct a given state in our system (hunger, pain, fatigue, etc.)

Longer term (think “WestWorld” or Asimov here), that thalamical area is where we’d insert our “Three Rules” :smiley: .

I also think some component of spiking NN’s could help on the macro-level as well, where perhaps potential outputs from the cortical pyramids would be accumulated over multiple timesteps… For each state-change category, +2 for every correctly predictive state-change pool (as an inverse anomaly score), -1 for every incorrectly predictive pool.

Hopefully I didn’t make things more confusing.

1 Like

Asking a SP to learn a trend is kinda out of scope, I guess?

Also, if the output of the entire primed is just the input (of the next time step). Wouldn’t it be the same to just pass the input to the motors?

This reminded me of a glaring problem of HTM that have bugged me for a while. How can signal be combined in a meaningful way? Simply concatenating SDRs together then send the result into a SP won’t cut because SP topology stops signals from being merged and learned efficiently. Stacking multiples SPs also doesn’t look like a viable solution due to run-away hebian learning. And there seems no other ways to integrate signals from multiple sources. Deep Learning uses gradients to guide the integration, but HTM’s local learning rules prevents it.

Hmm may you explain how this might work? I don’t see a path to actually implement this (given the issue described above). I think I’m thinking too computer science direction. May you inspire?

Asking a SP to learn a trend is kinda out of scope, I guess?

This is why we have TM.

As I understand the theory, this is the job of the “output layer” where long-distance lateral voting occurs. I’ve not come up with a good implementation of this yet myself, though.


I think I’ll just need to break this down and draw it out to illustrate what I’m thinking. I’ll try plopping that here by this time tomorrow. But you’re right in that we can’t go too “deep” with pool stacking, and our brain seems to reflect that as well.

Here’s a visual draft of my proposed model.

Input flow:

  1. Modulator (thalamus) receive raw input values from sensor/motor systems, encodes it.
  2. It is then concatenated with current state representation (encoded), and which state pools were back-activated the previous time-step.
  3. First level HTM node processes input (producing both SP/TM data).
  4. SDR representation of first level node is fed to state-change, concatenated with current state encoding (since we see that thalamus is connecting to multiple layers in cortical columns, this seems acceptable/plausible… play with this idea.)
  5. Anomaly score is calculated on state-change pools --> high score represents “surprise”, low score represents an accurate prediction.
  6. Send inverse of anomaly scores (concatenated from all state pools), so that correctly predictive state pools show up most to the modulator’s HTM node.
  7. Modulator HTM node receives encoding of I.A.S’s, concatenated with which intended goal was being pursued (can be a simple binary encoding of which goal was active or not for this time step)… this allows the modulator’s HTM node to learn the relationship between which state pool was correctly predictive for a given active goal.

Output flow:

  1. (Depending on logic) for goal which is actively being pursued, activate associated state pools.
  2. State pools receive their activation signal, triggering predictive columns to fire down their distal connections to the First Level node.
  3. Connected first level columns activate their distal connections into the output space, generating an output encoding (can optionally have some thresholding on those outputs, so that low-scoring connections don’t fire into the output encoding space)
  4. Modulator receives output encoding, and has option to intervene based on any desired logic, or not.
  5. Encoded output is decoded to raw form, and sent back out to the world.

Here’s an image of the logic flow, with black being the input pass, with red being the output pass.

You can imagine that there might be other parts, such as a node for thresholding the output encodings (so that weak outputs might not make it out the gate):

There is room for some flexibility here (such as what info to provide to which level, feedback, etc.), but I think this covers the initial idea I had for an HTM-based state machine for agent-based learning.

I need some time to digest it… Hmm… interesting…

1 Like

@MaxLee I finally understand most of the stuff. But still didn’t get a few points.

Pardon, I’m not familiar with neural science, what is a back activation? Is is predicting the previous state from current state?

So we encode the anomaly scores using a scalar encoder, then concatenate them? I guess you meant encourage the representation of nodes that have high anomaly scores in the modulator?

SP/TM only provides a unidirectional flow of computation. How will it send signal backwards?

1 Like

“back-activation” is a term I think I made up to describe of activating the columns and their distal connections to produce an “output”. This is related to your third question, which I’ll answer next.

Our traditional use of SP/TM has been in a forward pass manner: data is encoded to bits, and the minicolumns check their distal overlap with the encoded bits to find their activation score; chosen minicolumns then strengthen the overlapped connections, that we’ve been referring to as “distal connections”. My idea is to do this in reverse… have the minicolumns activate their strongly connected bits in a clone of the input space. We would activate the predicted columns, as determined by TM. The bits in the “output encoding” space might be a little noisy, so it might be useful to have an activation score for each bit, so that “weak” bits (which only one column might represent randomly as noise) don’t make it into the final output encoding.

So if we activate the “state pool(s)”, it would look at which minicolumns in are currently predictive. The idea is that given a current situation represented by the First Level IO node, the state pool(s), which remember those patterns that cause desired change, would know which SDR in the First Level IO would advance its goal. So that pool would then take those predictive minicols, then tell those minicolumns to reach down their distal connections (into the first SP/TM node). Those columns would then be activated, which would produce our output encoding.

In this manner, the SP/TM combination, instead of only acting as a memory of input patterns (what we’re already familiar with), will also act as an output system.

Correct. The intent is for the modulator to associate a desired goal with the state pool(s). So if we use an inverted anomaly score (where correctly predictive pools are given the higher score), we can encourage the modulator to make this association between state, goal, and which pools are potentially capable of advancing it.

1 Like