Brittleness vs sample efficiency

TLDR lots of AI research is geared more towards “guaranteed convergence” aka “make sure the model solves the task” than towards sample efficiency.

In some cases the (much) more sample or compute efficient solutions can be found if the requirement of guaranteed convergence is relaxed.

And I would make the case that the brittle yet efficient solution would be the correct path to follow.

I did lots of tests with ValueMap agents for CartPole environment RL problem, aiming at speed both in terms of computational overhead, simplicity and sample efficiency.

After literally thousands of times solving it, it appears there are two important distinctive “ways” to solve it:

  1. faster but brittle
  2. slower and reliably.

For context, OpenAI’s consider the problem solved when an agent achieves an average reward of 475 (out of max 500) points in 100 consecutive runs.
Formulated this way, no agent can solve the problem in less than 100 runs.

That benchmark is a bit confusing since it doesn’t count how many episodes the agent failed to reach the 500 steps mark to finish an episode.
e.g. a trial in which the problem is solved in 130 runs vs 115 could mean twice as many failed runs, even it appears it is only 14% “better”

Since their average threshold is 475, In theory an agent can “solve” the problem with 100% failure rate if in every last 100 trials it reached a reward less than 500, but higher than 475 on average

So a more rigorous benchmark for sample efficiency would be to consider how many failed trials (= less than 500 points per episode) the agent had to suffer before solving the OpenAI’s specified task.

The “robust” variants I tested could reliably solve the cartpole environment between 102 and 300 episodes and an average of 123. The failure rate being 26-27 failed episodes out of these.

The brittle & efficient agents can solve the same problem with 5 or less failures in 50% of tests, with a failure rate of 2 in 5% of tests.

Yes in some significant amount of cases the cartpole problem can be solved with only 2 failures.
Which means the agent goes:
“woops I failed, let’s fix this… woops, I failed again… aaa, that’s it!”.

It is also important to mention is the two variants above do not differ in the learning algorithm itself, but in both SDR size (and implicitly ValueMap size) and how the observation parameters are encoded in a SDR. It’s not only the resolution (SDR size and sparsity) but also how the encoding itself is done.

Just a notice, what I deem as “slow” above it is significantly faster - in both sample efficiency and computing speed) - than all deep learning agents I could find references about, which hints to the possibility that there are ingredients in DL that makes it intrinsically sample inefficient.

The cost of efficiency above comes with some brittleness which means that in under 1% of tests the agent simply could not solve the environment even in 1000 trials. (1000 was the limit I set before giving up).


And here are arguments on why we should favor sample efficiency even when it involves a certain amount of brittleness.

  1. Evolutionary consideration - if out of 1000 offsprings of cockroach A, 100 are able to learn very fast how to escape predation, while all 1000 babies of cockroach B takes a few more trials, then A might have a better chance of having successful inheritors, even if 900 of their kids get straight into suicide mode.

  2. Numbers: 100 brittle brains massed inside the same skull could use voting to overcome brittleness while preserving sample efficiency.

  3. Best solutions to most problems tends to also be among the simplest. At least in the majority of cases.

Whoever reaches first to the lowest hanging fruits has a better head start. There-s a survival competition out there.


What do you mean by sample efficiency? Does this mean the fewest number of training runs? Reduced failure rate? “Smart” sampling of the training space?


Sorry, in this case - yes, how many times the cartpole drops before the agent figures out how not to drop it for 500 consecutive steps, again and again till the environment is “solved”.

According to OpenAI’s “rules” the CartPole-v1 is considered solved when the average reward over past 100 episodes exceeds 475. Maximum reward per episode being 500

e.g. an arbitrary sample from an arbitrary paper

A training step consists of training the
network with one batch.

Here-s last of my sample out of 100 runs:

Ep.   1:  20 steps  
Ep.   2:   9 steps  
Ep.   3:  13 steps  
Ep.   4:  69 steps  
Ep.   5: 184 steps  
Ep.   6: 156 steps  
Ep.   7:  90 steps  
Ep.   8: 210 steps  
Ep.   9:   9 steps   
Ep. 103: 500 steps (*)   
Environment solved in 103 episodes, 47760 steps, 9 failures

(*) consecutive 500 timesteps “successes” overlap to make output nicer.

1 Like

The photo I linked above is a bit confusing what they call “Training step” is not an environment time step but a whole attempt to balance the cart till it drops or it stays up for 500 consecutive time steps.

Here are some of my stats, over 100 training sessions, all successful in the sense a new blank agent is trained till it solves the environment, for 100 times:

[ins] In [32]: r_rounds
array([100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100,
       100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100,
       100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100,
       100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100,
       100, 100, 100, 100, 100, 100, 100, 100, 101, 101, 101, 101, 101,
       101, 101, 101, 101, 101, 101, 101, 101, 101, 101, 101, 101, 102,
       102, 102, 102, 102, 102, 103, 103, 104, 104, 104, 107, 107, 107,
       108, 109, 109, 110, 111, 112, 118, 128, 133])

The longest one needed 133 trials, majority of trials solved it in the minimum number of 100

And here-s the sorted number of failures for each training session (== how many times the cartpole dropped before the session was solved)

[ins] In [33]: r_fails
array([ 2,  2,  2,  2,  2,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,
        3,  3,  3,  3,  3,  3,  3,  4,  4,  4,  4,  4,  4,  4,  4,  4,  4,
        4,  4,  4,  4,  4,  4,  4,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5,
        5,  5,  5,  5,  6,  6,  6,  6,  6,  6,  6,  6,  6,  6,  6,  6,  6,
        6,  6,  6,  6,  7,  7,  7,  7,  7,  7,  7,  7,  8,  8,  8,  9,  9,
        9, 10, 12, 12, 12, 13, 13, 14, 16, 16, 18, 18, 23, 26, 33])

[ins] In [34]: r_fails.mean()
Out[34]: 6.5

On average it dropped 7 times before learning how to keep it upright.

PS the result above is not “brittle” because I didn’t reinstantiated the compounding encoder, only the agent.
Brittleness means that agent fails to solve the environment even after 1000 episodes, it seems that it happens when an unfortunate random state of the compounding encoder overlaps with an unfortunate sequence of environment’s initial conditions.
Then it messes it up.


Sample efficiency can also be viewed as number of datapoints needed to solve the problem.
In cartpole case, this means the number of environment observations and with this encoder sometimes is lower than 30

Here-s the updated code - with both “robust” and “brittle” solutions. GitHub - Blimpyway/CartPoleChallenge: Balancing the cartpole. Fast

I just perceived that your code use no neurons, no dendrites, no spikes… only an SDR and some computations

Do you know some code that uses neurons + dendrites to solve gym environments? (CartPole, LunarLander…)

I found this example from OgmaNeo, but it does not use neurons either

Nope. All ANN I know of, HTM included, are some computing based on more or less accurate biological assumptions.

Even my code can be seen as a one hidden layer ANN with every node on the hidden layer performing AND on every pair of bits of the input SDR , and a second trainable layer (hidden->output one) to learn a “better choice”.

All learners have the purpose to seek, emphasize/remember and exploit correlations between various input data points.


This almost sounds like a job for a highly-modified NEAT algorithm! LOL


I haven’t got a chance to try out this idea on practical (even toy) MachineLearning problems (my problems at hand haven’t reached such a stage for it), but I think it’s straight forward:

Just germinate multiple learning models, keep and use ones those trustworthy.

Traditional ensemble learning does this half way, models are statically arranged, they don’t compete for rights-to-speak or survival, and usually only differ in parameters (i.e. all models of same formulation).

Novel gradients of AI in this idea is about how “trustworthy” is implemented, it can be rather complex for appropriate weight an individual’s judgement get adopted, and even conditions to eliminate a range of models dynamically.

Diverse population based models may appear cheating from an AI framework’s perspective, though a typical ANN with deep layers would more or less foster such competitions among emergent patterns, by providing fixed number of nodes & weighting-edges. I would like to add principled designs in the models of learner, or maybe that’s “meta” models coordinating how simpler models work together.

First things to try may be using learning models with entropy measurements in facing new input patterns, like the least entropy the most trustworthy.

I’ve never gotten started, unfortunately.


The varying results in this particular case are not caused by the model itself (both brittle-fast-learner / reliable-slower-learner only have different input encodings.

What a genetic optimizer could do in this case is to explore various input encodings.

Regarding ensemble learning - here a couple observations:

  1. in certain cases (using the same input encoder) multiple ValueMap evaluators, trained separately,on different data points do not need to be evaluated individually and averaged as in normal ensembles. They all can be merged in a single one by simple overlap (addition) of their parameters. This is an unlikely option in the case of normal NN where each weight function/meaning is influenced by the specific datapoints used in training.

  2. An option to expand parameter space (number of evaluators) with little penalty on performance is to use SDR based routing in a preliminary layer.
    This simply means e.g. for 1000 evaluators, encode each input as 30/1000 SDR and use only the ensemble of corresponding 30 evaluators (out of 1000) for learning/inference for each data point. Not sure if this is clearer but it’s the same idea.


Reminds me of CapsNet which emphasizes routing (among other as well concerns), and a quick search I see an even greater improvement done:

Efficient-CapsNet: capsule network with self-attention routing | Scientific Reports.

we prove that the proposed architecture is still able to achieve state-of-the-art results on three different datasets with only 2% of the original CapsNet parameters.

The 2% rate sounds comparable to SDR at work, at least it means proper routing can save that much from dumb redundant resource consumption.

1 Like