Sample efficiency via autocomplete

Beware I haven’t tested this yet. I don’t know if it works.

For clarity I’ll describe the technique on MNIST. Let’s say we want to improve classification scores of a ML model (any model) given a very limited sample size - like only 100 test digits.

Here-s an experiment someone did with various classifiers. E.G training RMNIST/10 (which is 10 training samples for each digit) most usual simple classifiers reach a ~74% accuracy. Which is low but quite high considering how few datapoints informed the model.

The algorithm I propose works like this:

  1. Pick one prototype from the reduced training set.
  2. split it in arbitrarily number of small sub-samples. e.g. if image of digit “2” has 120 pixels pick a random 20 pixels of it for each sub-sample. Let’s say we pick 100 random such sub-samples of the one training digit. Let’s name them ss_00 to ss_99.
  3. Train a sequential model like TM to predict this arbitrary, cyclic sequence:
    ss_00 → ss_01 → … → ss_98 → ss_99 → ss_00
  4. do the same with all (e.g. 100) prototypes. Which means we train the TM with 100 looping sequences - one for each prototype.
  5. Of course we have one of the above classifier trained with only 100 digits that quickly overfits - that’s cheapest part of the algorithm.

Now we end up with a TM that when fed any arbitrary 20 pixels - and ask the TM to predict the next 20 pixels, then next, and so on, it is safe to assume that it will eventually converge towards only one of the sequences it was trained with. If we ran in inference TM a number of cycles (let’s say 200) - and we pick the last 100 inferencing steps to reconstruct an image with most frequent predicted pixels and feed that image (which supposedly converged towards one of the prototypes) to our vanilla classifier (from pt. 5 above).

So in order to guess an unknown query digit image we do the same:

  • pick a random number of 20 pixels from the image. Fed it to TM. Classify the result.
  • rinse and repeat for a few times - different sub-samples from the query digit might converge towards different prototypes at every run.

But the assumption here is most of the query trials would converge towards few prototypes that share significant similarities with the query digit.


Ok, the above is a rough outline. A variation would be that instead of having one large-ish random cycle for each prototype image, we feed-train the sequence predictor several smaller cycles (e.g. 20 cycles 20 samples each) derived from the same prototype.

The takeaway here is we can expand a very limited dataset to an arbitrarily large dataset of self-predictive loops, each converging towards a single prototype. Adding a bit of noise (e.g. 5 random pixels besides the 20) at every step per input sample might also help robustness.

1 Like

tbh, I believe that sample efficiency comes from the hability to grasp the “spatial” structure of the input stream and untangle the invariances in it.

thats what data augmentation tries to do and I think biological networks have some weird trickery to perform it too.

for example. imagine your input field is a 10x10 grid of pixels and your goal is to classify 1px thick horizontal and vertical lines that can appear anywhere in the field.

having one single sample gives you no information about the other samples unless you aleady know whats the spatial arrangement of the pixels and know that shifting all pixels over to the left preserves all semantic content.


Yes, it’s part of it - the article I linked shows a simple CNN outperforming classical ML networks and a MLP NN.

However, the MNIST I picked only as an example, not intending to limit this method to visual recognition problems.

One way to make it more spatial-aware (for MNISTcase at least) would be to instead of random pixels to use small patches.
A K-means (or other clustering algo) can easily “segment” a MNIST digit in small adjacent patches. E.G. 10 patches each image.

If each patch shape&position could be encoded with a space-aware algorithm that encodes both its position and shape (like this one), then we can train the sequential model with a series of (sdrs encoding) patches belonging to the same image.

In general any “thing” can be defined by the set of its “parts”.
I just assume here the thing can be simply described as a long “random walk” between its parts.

PS (few more thoughts)

  • there are maybe cleverer ways to navigate the “thing” than random walking between its parts. I don’t know, maybe our saccades have a few tricks on picking the next movement of the eye/fovea.
  • one quality of the algorithm above is it can be trivially parallelised - different threads can run their own walks on the same query image using the same sequence predictor.
  • there might be a general principle here that living intelligences convert any seemingly static problem in a dynamic, sequential one. Maybe we always navigate our inner representations of the world, both the fact that our awareness can’t really stop on a thing/thought and that transformers are pretty general suggests that “sequentiation” is a powerful strategy.