Handling extremely unbalienced dataset using SDR + Nearest Neighbor

marty1885 · December 23, 2018, 2:02pm

TL;DR. SDR + Nearest Neighbor is amazing at classifying objects when some class is very under represented.

I have been messing with SpatialPooler and SDRClassifer this time around. This time the task is to detect whether a patient will develop base on 20 different features from a quick blood test. (Instead of waiting a week for a culture).

I’ll skip the boring part of processing the data and encoding it into SDRs. And let’s start with the data. The data is a CSV file of 17000 entries. Where each line consists of blood test results and weather the patient has bacteraemia. And all cases of bacteraemia only consists of 217 of them. The class I want to detect being very under represented.

One day I tried to perform the classification using HTM and Nearest Neighbor and get 61% recall. Here are my findings.

SpatialPooling learns really fast.
It only take the SP 1 epoch over all training data for the SP to become effective. Following epochs does not help the detection rate.
Nearest Neighbor is super effective is such case
I have tried both SDRClassifer and Nearest Neighbor of this task. Somehow SDRClassifer performs poorly due to the under representation while Nearest Neighbor is not effected as much.
Turn on boosting when training even if the data is not temporally coherent
This is a surprise for me. But some test shows that boosting helps the SP to learn even if the data is not temporally coherent. I guess the shear effect of more neurons activating is powerful enough.
Turn off boosting when inferencing
This should be obvious. The input datastream is not temporally coherence. So there is really no reason the boot after training the model. And boosting introduces noise anyway…

The model ended up with a 63% recall and a AUC of 0.61! Amazing! I have tried other ML algorithms and most of them simply fail. Linear Regression and Bayesian Inference is doomed from the beginning. Neural Networks overfit every time even with regularization and data re-sampling. Random Forest can be 65% accurate. And one of my collage applied some crazy statistic techniques and ended up at 70% recall.

And here is the screen shot of the 61% AUC.

Unfortunately this time due to the dataset being under NDA; I can’t provide the dataset and by extension the code. I’m only allowed to share the results and the process.

(Also, just a note for Numenta. This work is not done nor deployed in a commercial setting. I simply garbed the dataset and run HTM on it in my spare time.)

Tachion · December 28, 2018, 12:55pm

And one of my collage applied some crazy statistic techniques and ended up at 70% recall.

just out of curiosity can you remember what technique did he use? Not a big statistics expert but its always interesting to see whats it all capable of

marty1885 · December 28, 2018, 3:11pm

I remember he said that he calculated the variance and distribution of each feature. Then redistribute and resampled them. I’ll ask him after New year.

rhyolight · December 31, 2018, 3:41pm

This is really interesting work, Marty, thank you for sharing it with all of us! @subutai, you might be interested in this.

Topic		Replies	Views
SDR Classifier for spatial pooler label Engineering	12	1030	August 16, 2020
87.15% accuracy using Spatial Pooler and a biologically possible classifer on MNIST Applications	8	854	May 16, 2020
Spatial Pooler Implementation for MNIST Dataset Implementations spatial-pooling , htm-implementations	49	6844	August 20, 2021
Spatial pooler => SDR for NN? Machine Learning spatial-pooling	6	726	September 26, 2019
Question about Spatial pooler Implementations	10	1088	December 26, 2021

Handling extremely unbalienced dataset using SDR + Nearest Neighbor

Related topics