Current Time Series Anomaly Detection Benchmarks are Flawed and are Creating the Illusion of Progress

Well - this is certainly alarming!

Neumenta is one of the systems considered in this paper.

Time series anomaly detection has been a perennially important topic in data science, with papers dating back to the 1950s. However, in recent years there has been an explosion of interest in this topic, much of it driven by the success of deep learning in other domains and for other time series tasks. Most of these papers test on one or more of a handful of popular benchmark datasets, created by Yahoo, Numenta, NASA, etc. In this work we make a surprising claim. The majority of the individual exemplars in these datasets suffer from one or more of four flaws. Because of these four flaws, we believe that many published comparisons of anomaly detection algorithms may be unreliable, and more importantly, much of the apparent progress in recent years may be illusionary. In addition to demonstrating these claims, with this paper we introduce the UCR Time Series Anomaly Datasets. We believe that this resource will perform a similar role as the UCR Time Series Classification Archive, by providing the community with a benchmark that allows meaningful comparisons between approaches and a meaningful gauge of overall progress.


I would not worry too much about this article. Although the authors do have some valid criticism of existing anomaly benchmarks, they also have some arguments that fail to impress.

All of the benchmark datasets appear to have mislabeled data, both false positives and false negatives. Of course, it seems presumptuous of us to make that claim …

There is an additional issue […] many of the anomalies appear towards the end of the test datasets. […] It is easy to see why this could be true. Many real-world systems are run-to-failure, so in many cases, there is no data to the right of the last anomaly. However, it is also easy to see why this could be a problem […] A naïve algorithm that simply labels the last point as an anomaly has an excellent chance of being correct.

The authors then introduce their own dataset which fixes these flaws.

I watched the video (posted below), talked with the author (over email), looked at the taxicab dataset, and I think I was too harsh in my initial assessment of this paper.

He argues that anomaly detection is so ill-posed that it is difficult to objectively measure. And then shows some real examples of this issue.


Video talk about from the author:

It is good to have people worrying about this aspect of benchmarking. The one line code test does not seem valid in the context of Numenta’s benchmark (I don’t know about other becnhmarks) because the same one line test should be run on all 58 different time series, not on a single time series. Any one time series may have anomalies detected with a simple line of code but I very much doubt the single line would work well across the entire benchmark of 58 time series, that is the real difficulty of the benchmark.

The claim that an algorithm that guesses anomalies toward the end will do very well is perhaps true in the Numenta benchmark but the anomalies are within windows so just guessing towad the end is unlikely to work well. Perhaps the detection algortihm could be biased to favor anomalies toward the end.

@dmac did the author share more detailed thoughts on the performance of Numenta’s algorithm and has he run Numenta’s benchmark with other detectors?

The benchmark suite the author provides seems setup to work well for deep learning - no need for online learning perhaps. It seems Numenta coud/should integrate those time series into their benchmark suite - but maybe there is little interest in that at Numenta now?

It looks like they finally published their datasets and provided some interesting supplemental material to read along with it. I don’t know why it was so hard to find, but here is the link to an 85MB zipped download.

This includes the presentation slides to the video shown above.

I will attest from personal experience that measuring TSAD performance is a really hard problem since the space of possible anomalous signals is unbounded. The best you can do is try to build a detector for normal signals and be reliable when the signal becomes no longer “normal”.

If you already know what anomalous signals look like, then its a more straightforward classification problem with plenty of data to train on. In most cases, however, your anomalous signals will have never been seen before and no data exists to train on them before they are encountered in the real world. So the best you can hope is to create a sufficiently versatile benchmarking dataset that covers all the possible anomalous characteristics you can imagine.

It’s really hard to wrap your mind around this problem and i’m impressed with the authors’ work on tackling this problem.