TLDR lots of AI research is geared more towards “guaranteed convergence” aka “make sure the model solves the task” than towards sample efficiency.

In some cases the (much) more sample or compute efficient solutions can be found if the requirement of guaranteed convergence is relaxed.

And I would make the case that the brittle yet efficient solution would be the correct path to follow.

I did lots of tests with ValueMap agents for CartPole environment RL problem, aiming at speed both in terms of computational overhead, simplicity and sample efficiency.

After literally thousands of times solving it, it appears there are two important distinctive “ways” to solve it:

- faster but brittle
- slower and reliably.

For context, OpenAI’s consider the problem **solved** when an agent achieves an average reward of 475 (out of max 500) points in 100 consecutive runs.

Formulated this way, no agent can solve the problem in less than 100 runs.

That benchmark is a bit confusing since it doesn’t count how many episodes the agent failed to reach the 500 steps mark to finish an episode.

e.g. a trial in which the problem is solved in 130 runs vs 115 could mean twice as many failed runs, even it appears it is only 14% “better”

Since their average threshold is 475, In theory an agent can “solve” the problem with 100% failure rate if in *every* last 100 trials it reached a reward less than 500, but higher than 475 *on average*

So a more rigorous benchmark for sample efficiency would be to consider how many failed trials (= less than 500 points per episode) the agent had to suffer before solving the OpenAI’s specified task.

The “robust” variants I tested could reliably solve the cartpole environment between 102 and 300 episodes and an average of 123. The failure rate being 26-27 failed episodes out of these.

The brittle & efficient agents can solve the same problem with 5 or less failures in 50% of tests, with a failure rate of 2 in 5% of tests.

Yes in some significant amount of cases the cartpole problem can be solved with only 2 failures.

Which means the agent goes:

“woops I failed, let’s fix this… woops, I failed again… aaa, that’s it!”.

It is also important to mention is the two variants above do not differ in the learning algorithm itself, but in both SDR size (and implicitly ValueMap size) and how the observation parameters are encoded in a SDR. It’s not only the resolution (SDR size and sparsity) but also how the encoding itself is done.

Just a notice, what I deem as “slow” above it is significantly faster - in both sample efficiency and computing speed) - than all deep learning agents I could find references about, which hints to the possibility that there are ingredients in DL that makes it intrinsically sample inefficient.

The cost of efficiency above comes with some brittleness which means that in under 1% of tests the agent simply could not solve the environment even in 1000 trials. (1000 was the limit I set before giving up).