Current Time Series Anomaly Detection Benchmarks are Flawed and are Creating the Illusion of Progress

Well - this is certainly alarming!

Neumenta is one of the systems considered in this paper.

Time series anomaly detection has been a perennially important topic in data science, with papers dating back to the 1950s. However, in recent years there has been an explosion of interest in this topic, much of it driven by the success of deep learning in other domains and for other time series tasks. Most of these papers test on one or more of a handful of popular benchmark datasets, created by Yahoo, Numenta, NASA, etc. In this work we make a surprising claim. The majority of the individual exemplars in these datasets suffer from one or more of four flaws. Because of these four flaws, we believe that many published comparisons of anomaly detection algorithms may be unreliable, and more importantly, much of the apparent progress in recent years may be illusionary. In addition to demonstrating these claims, with this paper we introduce the UCR Time Series Anomaly Datasets. We believe that this resource will perform a similar role as the UCR Time Series Classification Archive, by providing the community with a benchmark that allows meaningful comparisons between approaches and a meaningful gauge of overall progress.


I would not worry too much about this article. Although the authors do have some valid criticism of existing anomaly benchmarks, they also have some arguments that fail to impress.

All of the benchmark datasets appear to have mislabeled data, both false positives and false negatives. Of course, it seems presumptuous of us to make that claim …

There is an additional issue […] many of the anomalies appear towards the end of the test datasets. […] It is easy to see why this could be true. Many real-world systems are run-to-failure, so in many cases, there is no data to the right of the last anomaly. However, it is also easy to see why this could be a problem […] A naïve algorithm that simply labels the last point as an anomaly has an excellent chance of being correct.

The authors then introduce their own dataset which fixes these flaws.

I watched the video (posted below), talked with the author (over email), looked at the taxicab dataset, and I think I was too harsh in my initial assessment of this paper.

He argues that anomaly detection is so ill-posed that it is difficult to objectively measure. And then shows some real examples of this issue.


Video talk about from the author: