Brittleness vs sample efficiency

TLDR lots of AI research is geared more towards “guaranteed convergence” aka “make sure the model solves the task” than towards sample efficiency.

In some cases the (much) more sample or compute efficient solutions can be found if the requirement of guaranteed convergence is relaxed.

And I would make the case that the brittle yet efficient solution would be the correct path to follow.


I did lots of tests with ValueMap agents for CartPole environment RL problem, aiming at speed both in terms of computational overhead, simplicity and sample efficiency.

After literally thousands of times solving it, it appears there are two important distinctive “ways” to solve it:

  1. faster but brittle
  2. slower and reliably.

For context, OpenAI’s consider the problem solved when an agent achieves an average reward of 475 (out of max 500) points in 100 consecutive runs.
Formulated this way, no agent can solve the problem in less than 100 runs.

That benchmark is a bit confusing since it doesn’t count how many episodes the agent failed to reach the 500 steps mark to finish an episode.
e.g. a trial in which the problem is solved in 130 runs vs 115 could mean twice as many failed runs, even it appears it is only 14% “better”

Since their average threshold is 475, In theory an agent can “solve” the problem with 100% failure rate if in every last 100 trials it reached a reward less than 500, but higher than 475 on average

So a more rigorous benchmark for sample efficiency would be to consider how many failed trials (= less than 500 points per episode) the agent had to suffer before solving the OpenAI’s specified task.

The “robust” variants I tested could reliably solve the cartpole environment between 102 and 300 episodes and an average of 123. The failure rate being 26-27 failed episodes out of these.

The brittle & efficient agents can solve the same problem with 5 or less failures in 50% of tests, with a failure rate of 2 in 5% of tests.

Yes in some significant amount of cases the cartpole problem can be solved with only 2 failures.
Which means the agent goes:
“woops I failed, let’s fix this… woops, I failed again… aaa, that’s it!”.

It is also important to mention is the two variants above do not differ in the learning algorithm itself, but in both SDR size (and implicitly ValueMap size) and how the observation parameters are encoded in a SDR. It’s not only the resolution (SDR size and sparsity) but also how the encoding itself is done.

Just a notice, what I deem as “slow” above it is significantly faster - in both sample efficiency and computing speed) - than all deep learning agents I could find references about, which hints to the possibility that there are ingredients in DL that makes it intrinsically sample inefficient.


The cost of efficiency above comes with some brittleness which means that in under 1% of tests the agent simply could not solve the environment even in 1000 trials. (1000 was the limit I set before giving up).

2 Likes

And here are arguments on why we should favor sample efficiency even when it involves a certain amount of brittleness.

  1. Evolutionary consideration - if out of 1000 offsprings of cockroach A, 100 are able to learn very fast how to escape predation, while all 1000 babies of cockroach B takes a few more trials, then A might have a better chance of having successful inheritors, even if 900 of their kids get straight into suicide mode.

  2. Numbers: 100 brittle brains massed inside the same skull could use voting to overcome brittleness while preserving sample efficiency.

  3. Best solutions to most problems tends to also be among the simplest. At least in the majority of cases.

Whoever reaches first to the lowest hanging fruits has a better head start. There-s a survival competition out there.

2 Likes

What do you mean by sample efficiency? Does this mean the fewest number of training runs? Reduced failure rate? “Smart” sampling of the training space?

2 Likes

Sorry, in this case - yes, how many times the cartpole drops before the agent figures out how not to drop it for 500 consecutive steps, again and again till the environment is “solved”.

According to OpenAI’s “rules” the CartPole-v1 is considered solved when the average reward over past 100 episodes exceeds 475. Maximum reward per episode being 500

e.g. an arbitrary sample from an arbitrary paper

A training step consists of training the
network with one batch.

Here-s last of my sample out of 100 runs:

Ep.   1:  20 steps  
Ep.   2:   9 steps  
Ep.   3:  13 steps  
Ep.   4:  69 steps  
Ep.   5: 184 steps  
Ep.   6: 156 steps  
Ep.   7:  90 steps  
Ep.   8: 210 steps  
Ep.   9:   9 steps   
Ep. 103: 500 steps (*)   
Environment solved in 103 episodes, 47760 steps, 9 failures

(*) consecutive 500 timesteps “successes” overlap to make output nicer.

1 Like

The photo I linked above is a bit confusing what they call “Training step” is not an environment time step but a whole attempt to balance the cart till it drops or it stays up for 500 consecutive time steps.

Here are some of my stats, over 100 training sessions, all successful in the sense a new blank agent is trained till it solves the environment, for 100 times:

[ins] In [32]: r_rounds
Out[32]: 
array([100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100,
       100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100,
       100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100,
       100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100,
       100, 100, 100, 100, 100, 100, 100, 100, 101, 101, 101, 101, 101,
       101, 101, 101, 101, 101, 101, 101, 101, 101, 101, 101, 101, 102,
       102, 102, 102, 102, 102, 103, 103, 104, 104, 104, 107, 107, 107,
       108, 109, 109, 110, 111, 112, 118, 128, 133])

The longest one needed 133 trials, majority of trials solved it in the minimum number of 100

And here-s the sorted number of failures for each training session (== how many times the cartpole dropped before the session was solved)

[ins] In [33]: r_fails
Out[33]: 
array([ 2,  2,  2,  2,  2,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,
        3,  3,  3,  3,  3,  3,  3,  4,  4,  4,  4,  4,  4,  4,  4,  4,  4,
        4,  4,  4,  4,  4,  4,  4,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5,
        5,  5,  5,  5,  6,  6,  6,  6,  6,  6,  6,  6,  6,  6,  6,  6,  6,
        6,  6,  6,  6,  7,  7,  7,  7,  7,  7,  7,  7,  8,  8,  8,  9,  9,
        9, 10, 12, 12, 12, 13, 13, 14, 16, 16, 18, 18, 23, 26, 33])

[ins] In [34]: r_fails.mean()
Out[34]: 6.5

On average it dropped 7 times before learning how to keep it upright.

PS the result above is not “brittle” because I didn’t reinstantiated the compounding encoder, only the agent.
Brittleness means that agent fails to solve the environment even after 1000 episodes, it seems that it happens when an unfortunate random state of the compounding encoder overlaps with an unfortunate sequence of environment’s initial conditions.
Then it messes it up.

2 Likes

Sample efficiency can also be viewed as number of datapoints needed to solve the problem.
In cartpole case, this means the number of environment observations and with this encoder sometimes is lower than 30

Here-s the updated code - with both “robust” and “brittle” solutions. GitHub - Blimpyway/CartPoleChallenge: Balancing the cartpole. Fast

I just perceived that your code use no neurons, no dendrites, no spikes… only an SDR and some computations

Do you know some code that uses neurons + dendrites to solve gym environments? (CartPole, LunarLander…)

I found this example from OgmaNeo, but it does not use neurons either

Nope. All ANN I know of, HTM included, are some computing based on more or less accurate biological assumptions.

Even my code can be seen as a one hidden layer ANN with every node on the hidden layer performing AND on every pair of bits of the input SDR , and a second trainable layer (hidden->output one) to learn a “better choice”.

All learners have the purpose to seek, emphasize/remember and exploit correlations between various input data points.

2 Likes