NAB: faster optimization and scoring


#1

Hey folks! I’m still plugging away at improvements to my HTM implementation. In the process, I’ve been taking a look at NAB! I’ve had a lot of fun digging into the code, particularly the scorer.

Short version

I think that I’ve found a way to optimize and score a complete corpus (58 files of anomaly scores) in about 12 seconds on a single core, rather than the 10s of minutes with multiple cores I was getting with the built-in hill-climbing algorithm.

The basic algorithm is:

  • score every anomaly score as it if were labeled as a positive prediction
  • sort the whole data set by anomaly_score descending
  • starting with an out-of-bounds threshold (e.g. 1.10), iterate through the sorted list, updating scores, and looking for changes in anomaly score.

This results in a list of scores by threshold. From there, it’s trivial to just find the max score and associated threshold.

My code is still a work-in-progress, but you can see it here. It does not yet write out results the same way NAB does, but it can at least generate all of the necessary data elements.


Longer, graphical explanation!

Start with one of the benchmark data sets (exchange-2_cpm_results.csv, in this case).

This is the input time series. Purple is the probationary period. The two anomaly windows are shown in orange and green. The fundamental idea of anomaly detection is that a detector streams in this data and emits a series of anomaly predictions (below).

Each vertical bar here is an anomaly score. Basically, the closer to 1.0, the more strongly the detector believes something anomalous has happened. We pick a threshold value (about 0.95 in this picture – see the horizontal black bar). Any anomaly score over that threshold is considered to mean ‘anomaly detected’.

Thus, an ideal anomaly detector would emit a score above the threshold early in each anomaly window and also never surpass the threshold outside of an anomaly window.

(The values in purple are in the ‘probationary period’, where no scoring occurs, to give the detector time to adjust to the value stream).

So, based on the image above, it looks like we have 2 blue bars over the threshold (false positives), 1 orange bar (true positive for the first window), and then 2 or 3 green bars (true positives for the second window).


To know how well a detector is doing, the NAB folks devised a nifty scoring mechanism. Basically:

  • anomaly predictions closer to the start of an anomaly window score the highest
  • scores drop over time the further into an anomaly window a prediction is located
  • only the best (earliest) score for a given window counts.
  • predictions far from any anomaly window detract the most
  • predictions close to the right edge of an anomaly window detract a bit less than scores further away
  • no scoring during the probationary period.

These score are further weighted by a ‘profile’ which shifts the preference for true positives vs. false positives vs. false negatives. (from here on, I use the ‘standard profile’ from NAB)

Visually, this scoring system looks something like:
(ignore the render artifacts at the bottom right)

So, each point in time has both an anomaly score and a weighted NAB score. Now, we take all of the data points and sort them by anomaly score rather than by timestamp.


As you can see, there are a handful of records with anomaly scores over 0.75. (I’ve already filtered out all the records with anomaly scores of 0 as well as all of the probationary rows).

If we slide the threshold down towards 0.0, you can see how more and more records will turn from inactive to active anomaly predictions. Let’s zoom in to the first 20 records:

So, given our current threshold, only 6 records count as ‘active’ anomaly predictions at the moment:

  • 2 blue bars, which represent ‘out of window’ false positives
  • 1 orange bar, which represents the only true positive for the orange window
  • 3 green bars, which are the true positives for the green window

If we keep the same bars, but show their computed ‘weighted NAB scores’, we see:

In order to totally score this file at this threshold, all we need are these 6 bars:

  • each blue bar drops the score by 0.11
  • the red bar, as the only bar in the orange window, adds its score to the total
  • the first green bar, as the highest scoring bar, adds its score to the total

The smaller two green bars do not contribute to the total score. Similarly, if there was a window with no active predictions, that window would subtract the ‘false negative weight’ from the total score.

Now, because the records are in anomaly order (descending), we can just iterate over the remaining records, detecting when we drop down to a new threshold level. The math works out such that it is simple to keep the counts of true positive, true negatives, etc, up to date as we perform this sweep.

We end up with a total score for every distinct threshold in the file. If we start our threshold sweep above the maximum allowed, we can also check whether it would be better to ignore this detector entirely.


Here is the total NAB score by threshold value for this one file. At the far right, we see that a threshold of 1.10 results in a score of -(2 * false_negative_weight), because both windows count as undetected at this threshold. From there, we jump to our maximum possible score, which then nosedives as the threshold diminishes. (I clipped the image at -8, otherwise the long drop in score ended up dominating the image).


A few notes

  • The detector used here was NAB’s default numenta detector with anomaly likelihood turned off
  • The method generalizes from one file to the whole corpus easily.
  • Using the NAB-generated numenta_standard_scores.csv, I compared the various values (TP, FP, etc) to those generated by my script. Numbers match identically at this point.
  • I did notice that NAB seemed to pick less-than-optimal thresholds when approaching the limit of 1.0. As such, I believe that the method I’ve described does find optimal thresholds more accurately than NAB. However, this may not matter too much when likelihood is turned on, as anomaly scores tend to be further from 1.0 at that point.

So, with all of that said: did I miss anything? Any questions or things you want to me explain better?


#2

Quick update with some benchmarking numbers:

| Detector | Method | Score | Threshold | Total seconds |
|----------|--------|-------|-----------|---------------|
| numenta  | NAB    | 10.77 | 0.951     | 3722.14       |
| numenta  | sweep  | 15.83 | 1.000     | 4.84          |
| zhtm     | NAB    | 25.35 | 0.090     | 3662.13       |
| zhtm     | sweep  | 25.35 | 0.093     | 4.26          |
  • detector: either the NAB numenta detector with likelihood turned off or my custom zhtm detector
  • method: NAB is default NAB optimizer + scorer, sweep is my alternate method
  • score: the unnormalized score
  • threshold: the optimized threshold
  • total seconds: total time spent both optimizing and scoring the entire NAB corpus

ZHTM - score vs. threshold

This is the score vs. threshold plot for my ZHTM anomaly detector. The optimized thresholds for both NAB and my sweep method are shown as vertical lines. Note that despite the thresholds being slightly different, they are both in the same threshold ‘window’, and thus produce the same score.


Numenta (no likelihood) - score vs. threshold

This is the score vs. threshold plot for the Numenta detector (with likelihood turned off). Again, both the NAB threshold and the sweep threshold are displayed. In this case, the max score appears to be at threshold 1.0. However, the NAB detector, with it’s “approach but never reach” hill-climbing algorithm, doesn’t quite reach the max threshold.


Conclusion

Overall, I’ve learned a ton about both the temporal memory algorithm and the NAB framework, and had a blast doing it. I’m super grateful to the Numenta folks for releasing their code, so if you’re reading this, thank you!

My end-of-the-year is a little hectic, but I’m planning to get back to my own development at the start of the new year. I’ve got a major refactor to merge, documentation to update, code to cleanup, etc.


#3

This is interesting work, thank you for trying to improve NAB! How hard would it be to create a pull request with the required changes? How much code change is involved to update NAB?


#4

I’d be happy to try and make a pull request. I think the primary targets for code change would be getScore() and optimizeThreshold(), with most of the work on the optimizer side.

I don’t think a PR will be too hard, to be honest. It’ll take me a little time to make sure I’m following the existing style, as well as to validate or replace any affected tests, but that’s to be expected, I think. I’ll see what I can put together in the next week or two.


#5

@Balladeer I’m intrested in running your experments. I have successfully setup the enviroment and ran the tests in the ZHTM repository. Yet I don’t find a way to run the hot-gym example. How do I run it?


#6

Thanks for the interest! The code is bit of a mess; I’ll take a look at that bit of code again as soon as I can. I’ll be back with more comprehensive instructions, probably tomorrow evening (EST).


#7

I’ve added some documentation on running the hot gym visualization, you can find it here. In short:

  • get the example CSV from nupic
  • run the bokeh server: bokeh serve --show examples/hot_gym.py

I’ve also included a snapshot of what you should see when the bokeh server is running the visualization correctly.

And just fyi, the code for running the NAB scoring in this topic is currently in a separate branch; I’ll try to get that merged to the develop branch in the next few days.


#8

For those following along, I’ve created PR 328 to incorporate this algorithm for use in optimizing and scoring detectors.


#9

@Balladeer well done! That will be a pleasure to work with, never having to wait for twiddle to climb hills again will make us all move along much faster I bet :slight_smile: I wish this PR had been submitted and merged to master about 30 days ago, I think it would have saved me many hours which would not have been spent listening to cooling fans whirring.

However I am sure those that come hereafter will have little appreciation for hours of time that this single PR will have saved them, unfortunately. So hopefully these small token of kudos here are suffice. Seeing as I am currently listening to the twiddle soundtrack at this very moment, like a white noise earworm, this is very relevant to me at the moment :slight_smile:


#10

Hey! Glad to hear this work could save you some time :slight_smile:! My initial goal with exploring NAB was just to understand it better; that I (hopefully) came across a way to improve it was a pleasant surprise.


#11

Yes thanks a stack! @Balladeer I used your optimizer, scorer and sweeper from https://github.com/numenta/NAB/pull/328 and now my testing runs on my new detector have gone from 33 mins to 7 mins :slight_smile: And no more twiddle soundtrack :slight_smile: And I too have compared scores and yes I agree they are same and the current optimizer and scorer. Great effort.