Handling extremely unbalienced dataset using SDR + Nearest Neighbor


TL;DR. SDR + Nearest Neighbor is amazing at classifying objects when some class is very under represented.

I have been messing with SpatialPooler and SDRClassifer this time around. This time the task is to detect whether a patient will develop base on 20 different features from a quick blood test. (Instead of waiting a week for a culture).

I’ll skip the boring part of processing the data and encoding it into SDRs. And let’s start with the data. The data is a CSV file of 17000 entries. Where each line consists of blood test results and weather the patient has bacteraemia. And all cases of bacteraemia only consists of 217 of them. The class I want to detect being very under represented.

One day I tried to perform the classification using HTM and Nearest Neighbor and get 61% recall. Here are my findings.

  1. SpatialPooling learns really fast.
    It only take the SP 1 epoch over all training data for the SP to become effective. Following epochs does not help the detection rate.

  2. Nearest Neighbor is super effective is such case
    I have tried both SDRClassifer and Nearest Neighbor of this task. Somehow SDRClassifer performs poorly due to the under representation while Nearest Neighbor is not effected as much.

  3. Turn on boosting when training even if the data is not temporally coherent
    This is a surprise for me. But some test shows that boosting helps the SP to learn even if the data is not temporally coherent. I guess the shear effect of more neurons activating is powerful enough.

  4. Turn off boosting when inferencing
    This should be obvious. The input datastream is not temporally coherence. So there is really no reason the boot after training the model. And boosting introduces noise anyway…

The model ended up with a 63% recall and a AUC of 0.61! Amazing! I have tried other ML algorithms and most of them simply fail. Linear Regression and Bayesian Inference is doomed from the beginning. Neural Networks overfit every time even with regularization and data re-sampling. Random Forest can be 65% accurate. And one of my collage applied some crazy statistic techniques and ended up at 70% recall.

And here is the screen shot of the 61% AUC.

Unfortunately this time due to the dataset being under NDA; I can’t provide the dataset and by extension the code. I’m only allowed to share the results and the process.

(Also, just a note for Numenta. This work is not done nor deployed in a commercial setting. I simply garbed the dataset and run HTM on it in my spare time.)


And one of my collage applied some crazy statistic techniques and ended up at 70% recall.

just out of curiosity can you remember what technique did he use? Not a big statistics expert but its always interesting to see whats it all capable of


I remember he said that he calculated the variance and distribution of each feature. Then redistribute and resampled them. I’ll ask him after New year.


This is really interesting work, Marty, thank you for sharing it with all of us! @subutai, you might be interested in this.