Anomaly detection for rare irregular recurrent events

mseinstein · May 23, 2017, 2:29pm

Does anomaly detection work for rare but recurrent events? The example I am looking at is the number of people going through the turnstiles of the NYC subway system. As expected, the rate peaks in the morning and is higher on weekdays than weekends. On federal holidays the rate matches the weekend pattern instead of the weekday pattern. While there are multiple days like this throughout the year, I don’t tell the model when these days will occur and they occur without any regularity. Of note there are a couple of these days in the initial 500 sample learning period; however, in that initial period they only occur on Mondays.

I would expect the model to detect an anomaly on Thanksgiving, as there is a weekend pattern on a weekday and the model has never seen that occur on a Thursday before, but this is not the case. Additionally, even though the model has seen some Mondays which follow a weekend pattern, since there is no pattern to when these occur should I expect the model flag them as anomalies? I know in previous posts there has been discussion of turning off learning when there are regular anomalies that you want the model to detect but not incorporate as part of the normal pattern. However, these anomalies are irregular and rare and so I would not expect them to be incorporated into the pattern, in a manner similar to what is seen in mismatch negativity for a rare oddball stimulus.

I am using the learned swarming parameters from the github and nupic 0.6.0.

Here is a plot of the data showing the time around thanksigiving:

And here is the data

https://gist.github.com/mseinstein/5ad01dde38fc6c82e220d73c28d545d0

rhyolight · May 23, 2017, 4:02pm

You need more than 500 records. It looks like your data is at a 4-hour interval, so 500 records will only express the weekly pattern 12 times.

Do you have more data? Or even finer-grained data like hourly?

mseinstein · May 23, 2017, 4:22pm

The 500 records I was referring to was the default probationaryPeriod variable in nupic in which the anomaly_likelihood is set at 0.5, which after looking at the code is actually 388.

In total there are 1780 samples going from June 2016 to the end of March in 2017. Unfortunately, the data is only sampled every 4 hours.

rhyolight · May 23, 2017, 4:26pm

Can you give me a link or something? Or paste your model params somewhere? This should be working so there may be a few tweaks to make.

mseinstein · May 23, 2017, 4:36pm

I am using a modified version of the model_params found here. I switched the variable names and put in the appropriate min and max values, and have switched the code to work with nupic 0.6, i.e.,
‘model’: ‘HTMPrediction’, --> ‘model’: ‘CLA’,
tmEnable --> tpEnable
tmParams --> tpParams

Original file:

github.com

numenta/nupic/blob/master/examples/opf/clients/hotgym/anomaly/one_gym/model_params/rec_center_hourly_model_params.py

MODEL_PARAMS = \
{ 'aggregationInfo': { 'days': 0,
                       'fields': [],
                       'hours': 0,
                       'microseconds': 0,
                       'milliseconds': 0,
                       'minutes': 0,
                       'months': 0,
                       'seconds': 0,
                       'weeks': 0,
                       'years': 0},
  'model': 'HTMPrediction',
  'modelParams': { 'anomalyParams': { u'anomalyCacheRecords': None,
                                      u'autoDetectThreshold': None,
                                      u'autoDetectWaitRecords': None},
                   'clParams': { 'alpha': 0.01962508905154251,
                                 'verbosity': 0,
                                 'regionName': 'SDRClassifierRegion',
                                 'steps': '1'},
                   'inferenceType': 'TemporalAnomaly',

This file has been truncated. show original

My Modified version:

github.com

mseinstein/HTM/blob/master/Tests/Generic/model_params/Trnstl_103_ST_CORONA_model_params_original.py

MODEL_PARAMS = \
{ 'aggregationInfo': { 'days': 0,
                       'fields': [],
                       'hours': 0,
                       'microseconds': 0,
                       'milliseconds': 0,
                       'minutes': 0,
                       'months': 0,
                       'seconds': 0,
                       'weeks': 0,
                       'years': 0},
  'model': 'CLA',
  'modelParams': { 'anomalyParams': { u'anomalyCacheRecords': None,
                                      u'autoDetectThreshold': None,
                                      u'autoDetectWaitRecords': None},
                   'clParams': { 'alpha': 0.01962508905154251,
                                 'verbosity': 0,
                                 'regionName': 'SDRClassifierRegion',
                                 'steps': '1'},
                   'inferenceType': 'TemporalAnomaly',

This file has been truncated. show original

rhyolight · May 23, 2017, 4:48pm

Try using getScalarMetricWithTimeOfDayAnomalyParams() to get your model params instead. This should work in 0.6.0. (Someday I’ll update the example.)

mseinstein · May 24, 2017, 7:29pm

I was able to incorporate getScalarMetricWithTimeOfDayAnomalyParams() into the code. The anomaly prediction does a better job, I think, and does detect an anomaly for the first federal holiday in the data set. However, it still does not detect Thanksgiving or Christmas as anomalies, although to be fair the data starts to become a lot more erratic around that time of year.

Here is the plot of the same time period as plotted above;

Here is the link to the corresponding gist dataset

In addition to my original question about anomaly detection, a more general question is why is there a high anomaly likelihood (and often times a high anomaly score as well) when the prediction seems to perfectly match the data. Here is a screenshot of the same dataset showing anomaly detection as the highlighted area (don’t know how to do highlighting in plot.ly). For most of the highlighted area, the predicted perfectly overlaps the actual. I understand that the anomaly values are calculated from the active and predicted columns and not the actual and predicted values, but shouldn’t there still be some correlation between the two?

Just a note for anyone searching through the forums in the future, that using getScalarMetricWithTimeOfDayAnomalyParams() by default turns off prediction. If you want to enable prediction or have it work with the Hot Gym anomaly example, you need to add

params[‘modelConfig’][‘modelParams’][‘clEnable’] = True

before you create the model in

model = ModelFactory.create(modelConfig=params[“modelConfig”])

rhyolight · May 24, 2017, 7:37pm

From our docs:

A TemporalAnomaly model calculates the anomaly score based on the correctness of the previous prediction. This is calculated as the percentage of active spatial pooler columns that were incorrectly predicted by the temporal memory.

This means even though the model is creating decent predictions, it is also making a lot of extra predictions that are not coming true. This means there are several sequences it thinks it could be a part of. You could add a condition that includes actual prediction error to prevent these.

Topic		Replies	Views
Anomaly score/likelihood question NuPIC Community Fork question	6	738	September 2, 2020
Finding the predictability of a pattern of anomalies with HTM NuPIC	4	464	April 10, 2019
Getting Periodic Spikes in Prediction and no change in anomalyscore NuPIC anomaly-detection , prediction	3	627	April 13, 2018
Why am I seeing lot of false positives? NuPIC	12	2484	June 22, 2016
Repeated data not flagged as anomalous - why? NuPIC anomaly-detection	1	663	September 6, 2016

Anomaly detection for rare irregular recurrent events

Related topics