I have a simple prototype of anomaly detection running in Java. It’s getting 1hz time stamped data with measurements that typically sit around 30-40. (ms in this case).
When the data first starts flowing the anomaly scores are (rightfully) all over the map. After 4-500 they’ve settled down to a pretty even 0.0.
I’ve seen a few bursts of numbers in the 500+ (even 1000+) but these don’t seem to cause even a ripple in the anomaly score. If I was doing a simple rolling mean these values would stand out but they don’t seem ‘interesting’ to HTM.
Other cases Ive seen values like 110 (even 75) cause a score of 0.025 to appear.
My question(s):
is there some sensitivity value that I need to work with?
what constitutes a high value anomaly?
My next goal is to start saving this data to a permanent store so I can graph it and get a better sense of what’s going on. (Speaking of which - is there a param somewhere in the Java .onNext(Inference) that contains the latest values passed into the network? I haven’t found it yet. Would help greatly in storing / graphing)
Edit: after crossing the 1100 values barrier almost every value is a 1.0. A few 0.9 and 0.975 but still averaging in the 30-50ms range (values).
Hey Phil, you mentioned in another thread that you were encoding “time of day” and “day of week” semantics into your date encoding. If you are sending data in at 1Hz, it may be possible that this data rate will not represent daily or weekly anomalies because of the high resolution of the data stream. This leads me to questions about your data.
What does your data represent? Are there normally occurring daily or weekly patterns in your data that you can discern yourself when you plot it out?
Data is transmission rates for IP data between nodes. It’s measured in ms but there will be patterns at the day and week scale. There will also likely be ‘interesting’ things at much smaller scales - dropouts, increased times and so on.
One HTM model may not be able to identify both short-term patterns and long-term patterns. I might suggest you focus first on dealing with the daily / weekly patterns and aggregate your data to something more like 1 point every 10 minutes.
The larger scale project is to identify one or more techniques for anomaly detection. The resultant process needs to work at (roughly) the 10 second scale.
What should I change in my model to focus on the shorter time?
Why? I’m just curious where this constraint comes from.
I don’t think you’ll need to change your model, but you’ll want to remove the datetime encoder from the mix. Would it be possible to see a graph of your data, where patterns are recognizable?
Remove the column of data entirely? If not that, what should I use for the first column encoder?
I copied the params from here to see if I could make any headway. After it ran for about 1000 cycles I think NuPic gave up. Up to > 50K data points now and it has reported 0 for everything since.
The 1-10 second scale comes from the requirements for the project I’ve been tasked with. My company is looking for a machine learning library to use for anomaly detection in network traffic.
The lack of discernible patterns may be the root of the problems I’m seeing. Anomalies could take the form of:
If you’re looking at the second granularity in this case, I doubt that date semantics like time of day, day of week is going to make any difference, so you should try without it.
Does this data contain the likelihood score or the raw anomaly score? The likelihood score could drop to 0 if the data is inherently unpredictable. But you’ll probably see the anomaly score bouncing around a lot.
Anomaly-likelihood is not automatically configured with this first version of HTM.Java’s Network API “Anomaly Detection Feature”. For now, there is simply the anomaly score that is available. However, the NAPI includes a way to insert your own “nodes” (custom nodes) into the processing chain, which could handle Anomaly-Likelihood calculations, if you wanted to write your own node for that.
In fact, inserting “custom” nodes is really easy in HTM.Java! Check out this test, for an example.
The “Inference” object also has a method called Inference.getCustomObject() which can be used to retrieve an object inserted into the Inference by the custom “node” (Func or “function” object), see below (as excepted from the above mentioned test):
// Here's the "Func1" object (node added below in: .add(addedFunc) )
// This example takes in a ManualInput (mutable subclass of Inference), and returns a ManualInput.
// Below it simply returns a String. where "I" is the "Inference" object and I.customObject() takes in an
// object and returns the Inference or ManualInput itself.
Func1<ManualInput, ManualInput> addedFunc = I -> {
// Do your Anomaly Likelihood work here?
return I.customObject("Interposed: " + Arrays.toString(I.getSDR()));
};
Network n = Network.create("Generic Test", p)
.add(Network.createRegion("R1")
.add(Network.createLayer("L1", p)
.add(addedFunc) // <--- Just insert your function
.add(new SpatialPooler())));
Anomaly-Likelihood has extensive tests which can be referred to (see here)
You will have to talk to the Numenta engineers to learn how to use the Anomaly-Likelihood feature (how to track and feed data back in). But all you would do is instantiate your Anomaly-Likelihood object and use it in the above function.
Easy peasy?
(Hint: Try inserting your anomaly experimental code directly into the above mentioned test, that way you can experiment with it and run it?) You’ll need to add a TemporalMemory() also because the above test doesn’t include one.
From what I can garner it appears that the likelihood score is a rolling window gaussian. For the purposes of my POC(S) I can mostly eyeball it.
Im thinking that my anomaly detection params are either too sensitive or just plain wrong. This is a sample of 1000 data points. It seems like it’s pinning the detection value pretty regularly.
We typically threshold the likelihood with four (possibly anomaly) or five (definitely anomaly) 9’s. We also have an additional step for computing the log likelihood, which converts 0.99999 to 0.5 and 0.9999 to 0.4. The log likelihood is suitable for plotting, while the straight likelihood is not (as you demonstrated). See here for converting the log likelihood: