I came here after reading the paper “Unsupervised real-time anomaly detection for streaming data”. I am working on a dataset which has time and date, share GPS locations, speed, and acceleration with respect to device pitch, yaw, and roll value. So I am assuming I will need more than one encoder to find the anomalies (either caused intentionally or malfunctioning). I am not sure as I am totally new to HTM community and trying to understand the concepts. Any comments/suggestions will be of great help.
As per my knowledge there is a single model against each metric. No such multivariate thing came across me so far.
You can pass in multivariate streams into Nupic, yes. What I’d do is go into the nupic folder you downloaded and go to the directory: /examples/opf/clients/hotgym/anomaly/one_gym. Within one_gym you’ll find several files, one of which is a run.py file. In this file within the ‘runIoThroughNupic’ function (on lines 110-114) you’ll see the variables ‘timestamp’ and ‘consumption’ being defined and passed into the model for run. The values are being pulled out of a csv row, from the csv file containing your data which is imported.
With a csv file containing more variables like yours (often called ‘fields’ around here), you’ll just have to add more lines such as ‘speed = float(row)’, ‘pitch = float(row)’ (etc) and then add entries for them into the dict that’s passed into model.run() on lines 113-114.
The other piece of it is adding these fields into the associated ‘params’ file (named ‘rec_center_hourly_model_params.py’), which is located in the ‘model_params’ folder within one_gym. You’ll see that the file is composed of a bunch of nested dictionaries, and all you’ll have to do is add entries for your fields to the ‘encoders’ dict (as is being done for for ‘kw_energy_consumption’, ‘timestamp_timeofday’ and ‘timestamp_weekend’ on lines 30, 39 and 44 respectively).
It looks like all of your values besides date/time and GPS location would just be scalar values, so those entries would look just like the one for ‘kw_energy_consumption’. You could copy that one several times over, rename it for each new field and enter the appropriate min and max values. I have to inject one question of my own for everyone here: Are the ‘n=29’ and ‘w=21’ values not too close together?? I had better results with ‘n=300’ and ‘w=31’ myself.
Anyway as for the GPS data I know that there is a compatible encoder for this (i.e. the ‘geospatial_coordinate’ encoder). I’ve never actually implemented them or seen a params file that does, so if anyone could point us to an example please do!
Finally you’ll have to go into the run.py file and change the ‘predicted_field’ within the ‘createModel’ function on line 57 to any field included in your input. Lastly I’d note that the anomaly score & anomaly likelihood values tell you how anomalous the entire input was at/over a given time, and since the input is a concatenation of all input fields there’s no separating which fields are acting anomalously and which aren’t. If you want to see which fields are acting predictably and which anomalously that’s where you’d have to run a separate model for each one.
There’s a general rule of thumb to try and minimize the number of input fields to a single model (on the order of 5 or less), as too many can noise-ify the data and sort of clog up the spatial pooler. You can try all of them of course, I just think you might get your best result narrowing it down to 2-4 of the least noisy fields.
Hope this helps.
Thank you so much for the details. I really appreciate it. Currently, I am trying to minimize the dataset inputs and to build a co-relation map using PCA and then use that co-relation map to pass in the Nupic. I will definitely get back on this. Thank you again.
Awesome. Very excited to explore HTM world.
That sounds very cool, and I hope the description makes it smoother once you get the co-relation map. I’d definitely be curious to see any results once you’re ready!
In short - your best bet is to design an encoder that encodes all of that information into one binary vector representation. Make sure your binary representation is large enough to encode each piece of the data to the desired granularity.
Talking toward future HTM theory, I think the problem of multiple fields will be better solved once temporal pooling is applied to temporal memory functions. Each field (or a smaller subset of them) would have its own TM layer, and then all the TM layers would feed into a single TP layer with long-distance distal connections, biasing the TM layers below via apical feedback.
The series of inputs for one feature by itself might predict several possible “objects” being observed (not sure the right term here for pooling of TM input). But in combination with the series of inputs from other features, the list of possibilities can be narrowed down. This is similar to the SMI experiments that are being run with multiple “fingers” sensing different areas on a single object and narrowing down the possibilities very quickly through apical feedback.
@rhyolight , Unsupervised real-time anomaly detection for streaming data can we apply for the multiple fields like more than 100 . Is it possible ??
In theory, yes. But in practice in NuPIC, less than 10. You might want to look into encoding strategies for that much data to encode only the meaningful semantics.
He could run a small swarm even with that many variables theoretically right?
A small swarm returns completely worthless results. It really should be called a “debug” swarm, because it just kicks the tires of the swarm configuration and ensures that it can run properly, but it does not permute over all the variables it needs to find good model params.
And a medium or large swarm would take a really long time with that many fields right?
Yes. A really long time. It has to create and run NUPIC models that include lots of different combinations of encoder settings for each field. Adding a field is probably nearly an exponential increase in swarm time.
I am gonna try for 100 columns now , could you please suggest some thing , how to change encoding strategies .? Any example or reference you may provide .
Thanks for the valuable reply.
100 columns is still way too many to swarm over. I suggest less than 10 scalar fields. Encoding strategies will depend heavily on the type of data and what it represents. Here are some resources:
thanks for the great input!
In the supplementary material S4 of the paper “Unsupervised real-time anomaly detection for streaming data” you propose a simple method to combine the prediction errors of individual errors into one anomaly likelihood. In the text, you mentioned that this was implemented somewhere. Could you point me to a repo? Or is it somewhere within nupic?
From the discussion here another question arised: Assuming I have less than 10 attributes. Is it better to combine the encodings and use one model or would you rather use the method mentioned in the paper and combine the prediction errors of the individual models? And, if I have a lot more than 10 attributes, lets say 100, do you think that combining the errors into one anomaly score would still yield at least “okay” results?
The more attributes you have the more dubious it is to use a single model. This is because you’re cramming all that dimensionality into one SP, which gives each column a heavier representational load.
10 attributes may still be ok for a single model, but I think multi-model is generally better practice. Besides preventing representational overload, multi-model offers a finer look into the system. It lets you monitor each attribute and look for simultaneous anomaly spikes among sub-groups. This can offer insight into which attributes are influenced by the same underlying forces, causing them to become unpredictable at the same times.
For >>10 you should go with multi-model for sure. I’d recommend using the anomaly likelihood values fo each attribute instead of the raw anomaly scores though. With very noisy attributes you’ll get high anomaly scores all the time, which can skew your total anomaly score. The anomaly likelihood will essentially normalize for this.
Also along with the total anomaly approach (combining all scores into 1), you could also look for times when more than a certain proportion of the attributes have high anomaly likelihood (say 15%). It could be the case that the total anomaly value isn’t that high cause most attributes are acting predictably, while a small subset are acting anomalously. The threshold approach should catch this sub-group while the total anomaly approach might miss it.
Thanks for the detailed answer, it brings a lot of clarity!
Do you know if anyone already implemented the Multivariate Anomaly Detection as described in the paper? I don’t want to repeat work already done but rather built upon it.