The surprising effectiveness of a Fly Hash encoder

I sketched a SDR encoder for arbitrary vectors, inspired by the fly hash concept.
With the slight difference that the output SDR is “trained” for uniform probability of its bits firing on the initial training data set.
I will explain later what that “learning” is.
I used it to encode MNIST digits in SDRs and then tested the resulting SDRs with:

  1. simple classification with SDRClassifier as in the htm.core example
  2. a makeshift SDR Associative Memory (pretty much what mraptor requests here)which I’ll discuss in another topic.

First how FHEncoder works:

There is an imput data vector (of floats) which we want to project into a SDR with certain SIZE (e.g. 2000) and number of ON bits further refered as LENGTH (e.g. 100).
Sparsity is simple LENGTH / SIZE
There is a parameter called “Spread” which is further explained.

Let’s see how it encodes flattened (784 long vectors) MNIST digits

  1. each input point (MNIST pixel) is randomly projected to a fixed number of output points (e.g. 200 out of 2000) with weight 1. All other (2000-200=1800) output points have weight 0 attached to that input pixel. This 200 value is the “spread” parameter
  2. the output pixels scoring top LENGTH sums will put 1 all remaining will put 0 in the encoded SDR. So its a K-Winner Takes All with K equal to the LENGTH parameter.

The above summarizes pretty much the Fly Hash algorithm without learning.
The learning step adds further 2000 adjustable “output weights” to the results of step 1 above, that are kind of “multiplicative biases” . The algorithm increases or decreases all non-zero weights attached to the output point in such a way that its output value is ~=1 in 2% of the cases and less in the rest.

So the so called “learning” for every single output pixel works as following:

  • from (e.g.) 10000 training MNIST digits select its 2% highest output values.
  • calculate the mean of these 2% best results.
  • pick a multiplicative weight equal to the inverse (1/y) of the above mean (1/mean)

So what the “learning” does is adjust the outputs before K-WTA in a manner that each output has a 2% chance to “fire” on a similar dataset as the training one.

That’s why I used quotes for learning because it doesn’t actually learns anything, it can be regarded more as a statistical balancing operation which tries to avoid some output neurons firing too often or all the time and others seldom firing or not firing at all

The logic behind this output weight adjustment was simply the idea that a neuron that fires always or never has very little value from a learning perspective.

When I combined the above FHEncoder with a SDR Classifier, I noticed the results on classifying MNIST digits where in the 95% accuracy, only 0.1-0.4% behind the results when using a trained SP.

In order to provide “equal grounds” for the classifier the FHEncoder generated SDRs were adjusted to match the ones outputed by Spatial Pooler in

  • SDR size of 6240
  • sparsity of ~7.7% (*)

(*) Note: I know 7.7% is an odd value. The untrained SP initial sparsity is the “canonical” 5%, but after training I found it to increase ~7.7% . I don’t know why this is so or what it means.

I managed to put them in a github repository.
The relevant files are:

  • - the encoder itself.
  • - a slightly modified example. The changes are explained in the beginning of the .py file.
  • - the FHEncoder test with same train/test data and matching SDR size / sparsity.
  •, mnist_data.npz - a local fast loader for mnist digits in compressed numpy saved format. The sklearn online loader (originally used by htm.core’s was quite slow in my corner of the internet.
1 Like

sounds like SP … the difference : SP connect to ~85% of the inputs and SP uses boosting to fully use all cells

the keys in my implementation are k-WTA 2%, boosting and weights change their values on S-curve not linearly

how does it differ from SP ?

At first glance it seems closer to your Spatial Mapper than to the SP as described by Matt here

The following what seems to me to be different FHEncoder from SP:

  • the number of initial connections with input space is much lower (sparser if you like) for every output point. in this particular test each output sums the values in ~25 input points. That’s >25 fewer connections. I know these are adjustable in SP but I doubt it was ever intended to go that sparse.
  • the individual random connections between input and output points never change (there is no learning)
  • there is no individual weight or permanence for every connection.
  • there-s an individual weight for every output point and it is adjusted not in relationship with other output points but with its own … history. The top 2% I speak about refers to the history of the same single output cell.
  • The number of actual firing outputs on every evaluation is the same for whole batch.

So in summary it is modeled rather after the fruit fly’s olfactory circuit than the cortex structure assumptions used in HTM

Now that you mentioned I increased the input “spread” from 200 to 900 output points and the accuracy got even closer,
95.65% for SDR Clasifier trained on FHEncoder output vs 95.72% when trained on Spatial Pooler’s output.
That’s 7 digits more out of the 10000 test digits.

Beware I updated the github repo, because the had the SP learning commented out.

SDR Classifier trained on Spatial Pooler with learning False yielded >92.61% match, which is pretty good for what is basically an untrained random projection.

PS And I don’t know how much of the ~3% accuracy improvement of trained vs untrained SP comes from the simple fact the Spatial Pooler produces an SDR 50% denser after training - that’s 50% more data points for SDRCLassifier. to crunch on.

PS2 Increasing sparsity from 7.7% to 5% in FHEncoder reduces classifier accuracy by 0.35% . So the answer to the above question is only a small fraction of improvement can be attributed to SP’s output higher density

1 Like

For the digits that were in the incorrectly recognised bucket, how much of an overlap was there between aproaches ? Did the encoding change the characteristics of the accuracy ?

If you mean whether if they fail at the same digits, I have not checked.
From what I’ve seen with other simple MNIST classifiers, I suspect so. They are mostly influenced by figure pixel overlap. e.g. a skewed figure of “2” overlaps better with a few “5”-s than with most other “2”-s.
They don’t develop human’s high level modelling about “how digits should be drawn”, in terms of hand motions, that guide our judgement.

Yes, it was to see if there was any significant outliers that meant something different to the rest of the model.

It was more from the presepective as to the level of influence a particular encoding method has indirectly on the performance of the rest of the model. I realise that this type of modelling is effectively injecting a representation part way up a hierarchy which would have more detail below this level which would be optimal.

I see what you mean about the overlap, so is this just a case of the SDR being too shallow (missing a layer or two of hierarcy) to be able to ever get to 100%.