Has anyone tested nupic anomaly detection with the Yahoo webscope data set? The reason I ask is that I tested numenta HTM against a few of the Yahoo webscope files with a reconfigured version of the hotgym script and corresponding model file and the results don’t match up with the tagged anomalies from the Yahoo webscope/ydata-labeled-time-series-anomalies-v1_0/A4Benchmark/ files.
I am thinking that there are some model params to dial in on numenta HTM for the Yahoo data set, so I’d like to know if someone else has already done this.
I used some of the Yahoo Webscope data in my thesis. My thesis is not publicly available at the moment because the work is pending publication. In short, in my findings, the Numenta algorithm for anomaly detection was successful for a lot of the data inside the Yahoo Webscope database. There are various datasets in that database however that the Numenta algorithm was not suited for. More concretely, datasets where patterns are not repeated are not suited for the HTM networks (under a standard scalar encoding scheme). There are a handful of datasets, for example, in the Yahoo Webscope dataset that feature a signal that is constantly increasing (or decreasing) for the snippet that is asked to be analyzed. Constantly increasing to new values each timestep means new SDRs each timestep which result in high prediction error throughout the snippet because no repeatable pattern is being captured by a standard scalar encoding scheme. HTM networks operate by predicting repeated patterns of neuron excitation. If you are not encoding any repeatable patterns, HTM networks will be useless.
That being said, when repeated patterns of neuron excitation were being encoded, the Numenta algorithm was very useful. There were still some failures and I used those failures to demonstrate the benefits of my thesis contributions. I’ve attached one of the figures from my thesis that reported on a dataset from the Yahoo Webscope database.
@rhyolight Thanks for getting back to me so quickly! Yep, temporal data. I cloned/configured hotgym to process the Yahoo webscope webscope/ydata-labeled-time-series-anomalies-v1_0/A4Benchmark/ files.
@bkutt This is an EXCELLENT analysis and summary. Thanks a bunch for taking the time to share your experiences and insight–much appreciated! Also, congrats on your PhD!
@bkutt - that sounds great. Look forward to reading your paper when it comes out!
Are you able to run your improved algorithm on NAB as well? Would be cool to see the results.
One issue we had with the Yahoo dataset is that it is not openly accessible, and we are not allowed to see it. They can only send to “a faculty member, research employee or student from an accredited university” and not to “Commercial entities or to Research institutions not affiliated with a research university”. Go figure!!
@hokiegeek2: I appreciate the sentiment! It was an MS thesis.
@subutai: I did test on NAB. The highest obtained scores for each individual profile were:
Standard
Low FN
Low FP
72.66
77.46
67.32
And a paired t-test across many random initializations of the underlying HTM model showed a statistically significant increase in performance across all three profiles with the most significant increase being in the low FP category. All the details will be released if my paper is published! There’s a couple possible venues that are in the works/I have in mind.
The results look great. Look forward to seeing it published!
Incidentally, have you thought about posting to arxiv? That let’s you establish an early date for you work, distribute the paper, and still retain the ability to publish in a peer reviewed journal (which can take a while).
@subutai Thanks again for your thoughts on this (just returning to this now). Question: do you have any guidance on the following params specified in the NumentaDetector class:
The params self.inputMin and self.inputMax (set in the base class constructor) represent the min and max values you expect to see in your data, and are set in advance. These are sent in to the minVal and maxVal parameters in getScalarMetricWithTimeOfDayAnomalyParams and used to initialize the encoders (see initialize() method on line 113).
The confusingly named self.minVal and self.maxVal variables are used and updated internally within the class to detect spatial anomalies - you don’t need to set those.
probationaryPeriod is the number of initial data records that are used to kickstart the learning - anomalies detected during this time are ignored. This is set to 15% of each datafile in NAB. We usually like to see at least 500-1000 records before the HTM system can start outputting good anomalies.