Nupic anomaly detection tested with Yahoo webscope data set?

hokiegeek2 · July 4, 2018, 12:18pm

Hi Everyone,

Has anyone tested nupic anomaly detection with the Yahoo webscope data set? The reason I ask is that I tested numenta HTM against a few of the Yahoo webscope files with a reconfigured version of the hotgym script and corresponding model file and the results don’t match up with the tagged anomalies from the Yahoo webscope/ydata-labeled-time-series-anomalies-v1_0/A4Benchmark/ files.

I am thinking that there are some model params to dial in on numenta HTM for the Yahoo data set, so I’d like to know if someone else has already done this.

Thanks

–John

rhyolight · July 4, 2018, 1:54pm

Which data sets did you analyze? Were they temporal data streams?

subutai · July 5, 2018, 4:54pm

Hi @hokiegeek2,

The Hotgym examples don’t have the best anomaly detection scheme or parameters. I recommend using the NAB example source for doing anomaly detection:

github.com

numenta/NAB/blob/master/nab/detectors/numenta/numenta_detector.py

# ----------------------------------------------------------------------
# Copyright (C) 2014, Numenta, Inc.  Unless you have an agreement
# with Numenta, Inc., for a separate license for this software code, the
# following terms and conditions apply:
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU Affero Public License version 3 as
# published by the Free Software Foundation.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
# See the GNU Affero Public License for more details.
#
# You should have received a copy of the GNU Affero Public License
# along with this program.  If not, see http://www.gnu.org/licenses.
#
# http://numenta.org/licenses/
# ----------------------------------------------------------------------

This file has been truncated. show original

–Subutai

bkutt · July 5, 2018, 5:36pm

I used some of the Yahoo Webscope data in my thesis. My thesis is not publicly available at the moment because the work is pending publication. In short, in my findings, the Numenta algorithm for anomaly detection was successful for a lot of the data inside the Yahoo Webscope database. There are various datasets in that database however that the Numenta algorithm was not suited for. More concretely, datasets where patterns are not repeated are not suited for the HTM networks (under a standard scalar encoding scheme). There are a handful of datasets, for example, in the Yahoo Webscope dataset that feature a signal that is constantly increasing (or decreasing) for the snippet that is asked to be analyzed. Constantly increasing to new values each timestep means new SDRs each timestep which result in high prediction error throughout the snippet because no repeatable pattern is being captured by a standard scalar encoding scheme. HTM networks operate by predicting repeated patterns of neuron excitation. If you are not encoding any repeatable patterns, HTM networks will be useless.

That being said, when repeated patterns of neuron excitation were being encoded, the Numenta algorithm was very useful. There were still some failures and I used those failures to demonstrate the benefits of my thesis contributions. I’ve attached one of the figures from my thesis that reported on a dataset from the Yahoo Webscope database.

hokiegeek2 · July 5, 2018, 5:48pm

@subutai Gotcha, okay was thinking that perhaps I did not have things properly configured. I will checkout the file you specified here.

Thanks!

–John

hokiegeek2 · July 5, 2018, 5:51pm

@rhyolight Thanks for getting back to me so quickly! Yep, temporal data. I cloned/configured hotgym to process the Yahoo webscope webscope/ydata-labeled-time-series-anomalies-v1_0/A4Benchmark/ files.

–John

hokiegeek2 · July 5, 2018, 6:27pm

@bkutt This is an EXCELLENT analysis and summary. Thanks a bunch for taking the time to share your experiences and insight–much appreciated! Also, congrats on your PhD!

–John

subutai · July 5, 2018, 6:44pm

@bkutt - that sounds great. Look forward to reading your paper when it comes out!

Are you able to run your improved algorithm on NAB as well? Would be cool to see the results.

One issue we had with the Yahoo dataset is that it is not openly accessible, and we are not allowed to see it. They can only send to “a faculty member, research employee or student from an accredited university” and not to “Commercial entities or to Research institutions not affiliated with a research university”. Go figure!!

bkutt · July 10, 2018, 5:26pm

@hokiegeek2: I appreciate the sentiment! It was an MS thesis.

@subutai: I did test on NAB. The highest obtained scores for each individual profile were:

Standard	Low FN	Low FP
72.66	77.46	67.32

And a paired t-test across many random initializations of the underlying HTM model showed a statistically significant increase in performance across all three profiles with the most significant increase being in the low FP category. All the details will be released if my paper is published! There’s a couple possible venues that are in the works/I have in mind.

subutai · July 10, 2018, 6:44pm

The results look great. Look forward to seeing it published!

Incidentally, have you thought about posting to arxiv? That let’s you establish an early date for you work, distribute the paper, and still retain the ability to publish in a peer reviewed journal (which can take a while).

–Subutai

hokiegeek2 · July 23, 2018, 5:00pm

@subutai Thanks again for your thoughts on this (just returning to this now). Question: do you have any guidance on the following params specified in the NumentaDetector class:

minVal
maxVal
probationaryPeriod

Thanks

–John

subutai · July 24, 2018, 7:36pm

The params self.inputMin and self.inputMax (set in the base class constructor) represent the min and max values you expect to see in your data, and are set in advance. These are sent in to the minVal and maxVal parameters in getScalarMetricWithTimeOfDayAnomalyParams and used to initialize the encoders (see initialize() method on line 113).

The confusingly named self.minVal and self.maxVal variables are used and updated internally within the class to detect spatial anomalies - you don’t need to set those.

probationaryPeriod is the number of initial data records that are used to kickstart the learning - anomalies detected during this time are ignored. This is set to 15% of each datafile in NAB. We usually like to see at least 500-1000 records before the HTM system can start outputting good anomalies.

hokiegeek2 · July 24, 2018, 7:42pm

Perfect, thank you!

–John

Topic		Replies	Views
Anomaly Detection - Poor results - Build issues or Tuning issues on Real Data NuPIC	1	397	June 7, 2020
High anomalylikelihood values for hot gym anomaly example NuPIC anomaly-detection	11	2049	May 15, 2017
Noobie Question: How to use NuPIC for a NAB dataset? NAB	9	1458	November 23, 2019
Anomaly detection Newbie NuPIC	3	816	October 25, 2017
Comparing NuPIC to other ML techniques Machine Learning machine-learning , htm , nupic-wiki	0	1548	April 6, 2017

Nupic anomaly detection tested with Yahoo webscope data set?

Related topics