Is there any real world public dataset of anomaly detection example?

Hello everyone!
After some serious search online I could not find any real world example of using HTM for anomaly detection.
By “real world dataset” I mean a standard, public, well tested, annotated dataset - for example this.
By “example” I mean a .py of notebook that presents how to take data, preprocess it, configure the SP and TM, training and getting at least decent results.
I am aware that there is the hotgym.py example in the git repo, but I seems that doing the same pipeline and configs with real world datasets just seems to not work properly.
I would extremely appreciative of someone could share a link for such example of even better a .py/JN of such an example.
BTW I am using HTM.Core since I wand python 3.

Thanks,
Yoni.

Hey @Yoni_Cohen,

Welcome! Here’s NAB for the win:

Hi @sheiser1 thank you for your answer!

I am aware of NAB, but I’m trying to find an example of using HTM for anomaly detection on datasets that are out of the NAB.
The reason I’m asking it is that I was trying to use HTM for anomaly detection on several public datasets (for example dodgers) and I could not get any reasonable results.
I’m using HTM.Core implementation and I tried both raw tm.anomaly and anomaly score which didn’t work well (the prediction ability however is pretty great). I used the parameters of the TM and SP from the hotgym example.
This makes me thinking that I might be missing the basic approach of how to use HTM as an out-of-the-box model for AD and I wanted an example of how to take a simple dataset and use HTM for AD.
Thank you again

Hey @Yoni_Cohen ,

I bet the encoding parameters are the problem.

In my experience there are 2 main causes of bad performance:

  1. Encoders not suited to data

  2. Sampling frequency too high

A most reasonable request, tho I’m not aware of one for htm.core.

I’d suggest:

  1. Check for bad encoders – Plot histograms of all input feature values, to inform the encoding parameters.

  2. Check for bad sample rate – Plot out your data over time, to see how many time steps it takes for the patterns to repeat. The longer, noisier & more complex the sequences are, the more data it will take for the TM to learn them well. If they are long, try reducing the sample rate somehow to make the patterns show themselves faster.

  3. Re-run and check anomaly scores over time – Use a Scalar encoder (rather than RDSE encoder). RDSE is more flexible, but to me Scalar is more intuitive to control since you can set min & max. Trying setting the min & max based on the feature distributions from step 1. The min & max should be such that most of feature values fall between them.

Hope this helps some!