I think that I’ve found a way to optimize and score a complete corpus (58 files of anomaly scores) in about 12 seconds on a single core, rather than the 10s of minutes with multiple cores I was getting with the built-in hill-climbing algorithm.
The basic algorithm is:
- score every anomaly score as it if were labeled as a positive prediction
- sort the whole data set by
- starting with an out-of-bounds threshold (e.g. 1.10), iterate through the sorted list, updating scores, and looking for changes in anomaly score.
This results in a list of scores by threshold. From there, it’s trivial to just find the max score and associated threshold.
My code is still a work-in-progress, but you can see it here. It does not yet write out results the same way NAB does, but it can at least generate all of the necessary data elements.
Longer, graphical explanation!
Start with one of the benchmark data sets (
exchange-2_cpm_results.csv, in this case).
This is the input time series. Purple is the probationary period. The two anomaly windows are shown in orange and green. The fundamental idea of anomaly detection is that a detector streams in this data and emits a series of anomaly predictions (below).
Each vertical bar here is an anomaly score. Basically, the closer to
1.0, the more strongly the detector believes something anomalous has happened. We pick a threshold value (about 0.95 in this picture – see the horizontal black bar). Any anomaly score over that threshold is considered to mean ‘anomaly detected’.
Thus, an ideal anomaly detector would emit a score above the threshold early in each anomaly window and also never surpass the threshold outside of an anomaly window.
(The values in purple are in the ‘probationary period’, where no scoring occurs, to give the detector time to adjust to the value stream).
So, based on the image above, it looks like we have 2 blue bars over the threshold (false positives), 1 orange bar (true positive for the first window), and then 2 or 3 green bars (true positives for the second window).
To know how well a detector is doing, the NAB folks devised a nifty scoring mechanism. Basically:
- anomaly predictions closer to the start of an anomaly window score the highest
- scores drop over time the further into an anomaly window a prediction is located
- only the best (earliest) score for a given window counts.
- predictions far from any anomaly window detract the most
- predictions close to the right edge of an anomaly window detract a bit less than scores further away
- no scoring during the probationary period.
These score are further weighted by a ‘profile’ which shifts the preference for true positives vs. false positives vs. false negatives. (from here on, I use the ‘standard profile’ from NAB)
Visually, this scoring system looks something like:
(ignore the render artifacts at the bottom right)
So, each point in time has both an anomaly score and a weighted NAB score. Now, we take all of the data points and sort them by anomaly score rather than by timestamp.
As you can see, there are a handful of records with anomaly scores over 0.75. (I’ve already filtered out all the records with anomaly scores of 0 as well as all of the probationary rows).
If we slide the threshold down towards 0.0, you can see how more and more records will turn from inactive to active anomaly predictions. Let’s zoom in to the first 20 records:
So, given our current threshold, only 6 records count as ‘active’ anomaly predictions at the moment:
- 2 blue bars, which represent ‘out of window’ false positives
- 1 orange bar, which represents the only true positive for the orange window
- 3 green bars, which are the true positives for the green window
If we keep the same bars, but show their computed ‘weighted NAB scores’, we see:
In order to totally score this file at this threshold, all we need are these 6 bars:
- each blue bar drops the score by 0.11
- the red bar, as the only bar in the orange window, adds its score to the total
- the first green bar, as the highest scoring bar, adds its score to the total
The smaller two green bars do not contribute to the total score. Similarly, if there was a window with no active predictions, that window would subtract the ‘false negative weight’ from the total score.
Now, because the records are in anomaly order (descending), we can just iterate over the remaining records, detecting when we drop down to a new threshold level. The math works out such that it is simple to keep the counts of true positive, true negatives, etc, up to date as we perform this sweep.
We end up with a total score for every distinct threshold in the file. If we start our threshold sweep above the maximum allowed, we can also check whether it would be better to ignore this detector entirely.
Here is the total NAB score by threshold value for this one file. At the far right, we see that a threshold of 1.10 results in a score of
-(2 * false_negative_weight), because both windows count as undetected at this threshold. From there, we jump to our maximum possible score, which then nosedives as the threshold diminishes. (I clipped the image at
-8, otherwise the long drop in score ended up dominating the image).
A few notes
- The detector used here was NAB’s default
numentadetector with anomaly likelihood turned off
- The method generalizes from one file to the whole corpus easily.
- Using the NAB-generated
numenta_standard_scores.csv, I compared the various values (TP, FP, etc) to those generated by my script. Numbers match identically at this point.
- I did notice that NAB seemed to pick less-than-optimal thresholds when approaching the limit of 1.0. As such, I believe that the method I’ve described does find optimal thresholds more accurately than NAB. However, this may not matter too much when likelihood is turned on, as anomaly scores tend to be further from 1.0 at that point.
So, with all of that said: did I miss anything? Any questions or things you want to me explain better?