Overconstraining Threshold?

From the NAB whitepaper “We constrain the detectors to use a single detection threshold for the entire dataset.” I am not sure this is a reasonable approach for a community benchmark because some algorithms may generate anomaly scores that are not naturally in range 0 to 1. For a given time series the anomaly score from the detector should indicate where the anomalies are. If the detector gives relative scores for a given time series then the optimizer can find the right threshold for each time series. Because all the time serie are run independently, asking the detector to provide an optimal anomaly score in a fixed range, seems strange. If we want to have a single threshold then we should provide a single training series (i.e. merge all the time series into a single timeseries) ?

Did Numenta consider using a threshold per time series? I wonder if it would significantly impact the ranking of the different algorithms?

At the end of the day, a good anomaly detector will be useful in a given context i.e. a class of time series (like the different folders of types of time series in NAB) so we should score per time series (or time series class) rather than adding an artificial constraint of a global threshold - because in the real world a detector will be “tuned” for its context (a human will help decide how sensitive the detecton of anomalies should be).