I am trying to run the HTM anomaly detection algorithm from the NUPIC library in python. I have datasets spanning 22 days and consisting of telemetry information about devices. I have found a hotgym dataset implementation of HTM online, however am unable to pick which parameters to tweak in order to fit my data perfectly.
Additionally, is the Anomaly Score same as Anomaly Likelihood? If not, is there an example (sample implementation) of anomaly likelihood that I could reference to?
Params I am running anomaly detection with:
‘anomalyCacheRecords’: None,
‘autoDetectThreshold’: None,
‘autoDetectWaitRecords’: 2184
Usually the anomaly score refers to the raw metric, as reported by the temporal-memory code.
The anomaly likelihood function takes the raw anomaly scores and does some statistics on them to determine where the anomalies actually are. It calculates the normal distribution of anomaly scores, and measures how far above the distribution each new anomaly score is.
One great thing about HTM is that there is usually no need to tweak hyperparams! I would leave those alone for now, because in my experience the relevant tweaking is around:
Which feature(s) to include in the model
What encoder params values should be (min/max for Scalar encoder & resolution for RDSE)
What the anomaly likelihood window sizes should be.
The anomaly likelihood comes from comparing a smaller recent window of anomaly scores to a larger window spanning further back. If the distribution of recent anomaly score deviates more from the larger distribution the anomaly likelihood is higher – as the system’s predictability appears to be changing.
So higher anomaly likelihoods signify recent changes in predictability, whether getting more or less predictable. The anomaly score however is just a measure of how predictable one time step was (0 being perfectly predictable and 1 perfectly unpredictable).
If you could show your current implementation and maybe a small snippet of data that would help too.
I re-implemented the anomaly likelihood class for htm-core because the original nupic implementation was a mess. My new version is much shorter and easier to comprehend.
Anyways its on htm.core github repo on a branch named anomaly_likelihood_rewrite
I kind of abandoned this work though…
The original nupic implementation contained a number of workarounds (hacks) to compensate for known issues with htm / nupic. The score on NAB went down when I tried this re-implementation, and I think its because I removed those hacks.
If you were using either nupic’s or htm.core’s anomaly likelihood code, then try the file I just posted.
You will want to use an automatic parameter search.
This will only help if your program fundamentally works. It won’t fix bugs.
Doing a parameter search manually, with pen and paper, can be highly informative for learning about what the parameters do, but time consuming.
Therefore, you will need a program to search automatically in order to find the best parameters.
Something to be aware of: There is an element of randomness in HTM’s. So when you measure their performance: you wont get one single score but rather there is a distribution of scores, which you’re sampling from.
There are ways to measure the quality of an SDR, so you can check whether each piece of the system is working. If you are using htm.core, then look for the class (EDITED) htm.Metrics
The Metrics class will measure an SDR and print out the following table of info about the SDR:
The Activation Frequency is measured for each bit of the SDR. Use this to see if some bits are stuck off (active-freq == 0) or stuck on (active-freq == 1)
The entropy is the binary entropy of the activation frequencies, and it’s been normalized into the range 0,1 where 1 is the maximum possible value.
A higher entropy means that the bits of the SDR are being utilized more equally.
The overlap is the fraction of 1’s which stay the same between consecutive assignments to the SDR.
That model params file shows a Scalar encoder with min/max = 0/100. I would definitely recommend checking your data against those – like a histogram of the data w/vertical lines at 0 & 100.
By default I set the mix/max using percentiles of a data sample (maybe 1st/99th or 5th/95th depending on the distribution). You want the bulk of your data comfortably between the min & max, so hopefully the distribution is normal-ish or uniform-ish (to make that easy).
The Scalar encoder sees all values above the max as the max, and all below the min as the mix. So if your data falls outside the bounds you’re losing tons of information on the way from raw data to SDR.
There is some description of the TM parameters here at the top where Temporal Memory class is defined:
How much do the timestamp params impact the performance of the model?
i.e., the hotgym example specifies weekdays and weekends in the model_params. I was wondering if that impacts the efficiency drastically?
For the hotgym example the timestamps are important. The hotgym has events which happen at the same time every day/week and I think that some of the anomalies are when the gym opens early/late. Without supplying the timestamps it would be very difficult to detect such anomalies.