Can HTM be used on multivarient time series problems?

Prinkle_Sharma · January 17, 2018, 7:09pm

Hi Everyone,
I came here after reading the paper “Unsupervised real-time anomaly detection for streaming data”. I am working on a dataset which has time and date, share GPS locations, speed, and acceleration with respect to device pitch, yaw, and roll value. So I am assuming I will need more than one encoder to find the anomalies (either caused intentionally or malfunctioning). I am not sure as I am totally new to HTM community and trying to understand the concepts. Any comments/suggestions will be of great help.

MZamanMughal · January 17, 2018, 7:13pm

As per my knowledge there is a single model against each metric. No such multivariate thing came across me so far.

rhyolight · January 17, 2018, 7:57pm

sheiser1 · January 17, 2018, 8:06pm

You can pass in multivariate streams into Nupic, yes. What I’d do is go into the nupic folder you downloaded and go to the directory: /examples/opf/clients/hotgym/anomaly/one_gym. Within one_gym you’ll find several files, one of which is a run.py file. In this file within the ‘runIoThroughNupic’ function (on lines 110-114) you’ll see the variables ‘timestamp’ and ‘consumption’ being defined and passed into the model for run. The values are being pulled out of a csv row, from the csv file containing your data which is imported.

With a csv file containing more variables like yours (often called ‘fields’ around here), you’ll just have to add more lines such as ‘speed = float(row[2])’, ‘pitch = float(row[3])’ (etc) and then add entries for them into the dict that’s passed into model.run() on lines 113-114.

The other piece of it is adding these fields into the associated ‘params’ file (named ‘rec_center_hourly_model_params.py’), which is located in the ‘model_params’ folder within one_gym. You’ll see that the file is composed of a bunch of nested dictionaries, and all you’ll have to do is add entries for your fields to the ‘encoders’ dict (as is being done for for ‘kw_energy_consumption’, ‘timestamp_timeofday’ and ‘timestamp_weekend’ on lines 30, 39 and 44 respectively).

It looks like all of your values besides date/time and GPS location would just be scalar values, so those entries would look just like the one for ‘kw_energy_consumption’. You could copy that one several times over, rename it for each new field and enter the appropriate min and max values. I have to inject one question of my own for everyone here: Are the ‘n=29’ and ‘w=21’ values not too close together?? I had better results with ‘n=300’ and ‘w=31’ myself.

Anyway as for the GPS data I know that there is a compatible encoder for this (i.e. the ‘geospatial_coordinate’ encoder). I’ve never actually implemented them or seen a params file that does, so if anyone could point us to an example please do!

Finally you’ll have to go into the run.py file and change the ‘predicted_field’ within the ‘createModel’ function on line 57 to any field included in your input. Lastly I’d note that the anomaly score & anomaly likelihood values tell you how anomalous the entire input was at/over a given time, and since the input is a concatenation of all input fields there’s no separating which fields are acting anomalously and which aren’t. If you want to see which fields are acting predictably and which anomalously that’s where you’d have to run a separate model for each one.

There’s a general rule of thumb to try and minimize the number of input fields to a single model (on the order of 5 or less), as too many can noise-ify the data and sort of clog up the spatial pooler. You can try all of them of course, I just think you might get your best result narrowing it down to 2-4 of the least noisy fields.

Hope this helps.

Prinkle_Sharma · January 18, 2018, 4:41am

Thank you so much for the details. I really appreciate it. Currently, I am trying to minimize the dataset inputs and to build a co-relation map using PCA and then use that co-relation map to pass in the Nupic. I will definitely get back on this. Thank you again.

Prinkle_Sharma · January 18, 2018, 4:42am

Awesome. Very excited to explore HTM world.

sheiser1 · January 18, 2018, 4:00pm

That sounds very cool, and I hope the description makes it smoother once you get the co-relation map. I’d definitely be curious to see any results once you’re ready!

bkutt · January 18, 2018, 5:56pm

In short - your best bet is to design an encoder that encodes all of that information into one binary vector representation. Make sure your binary representation is large enough to encode each piece of the data to the desired granularity.

Paul_Lamb · January 18, 2018, 6:07pm

Talking toward future HTM theory, I think the problem of multiple fields will be better solved once temporal pooling is applied to temporal memory functions. Each field (or a smaller subset of them) would have its own TM layer, and then all the TM layers would feed into a single TP layer with long-distance distal connections, biasing the TM layers below via apical feedback.

The series of inputs for one feature by itself might predict several possible “objects” being observed (not sure the right term here for pooling of TM input). But in combination with the series of inputs from other features, the list of possibilities can be narrowed down. This is similar to the SMI experiments that are being run with multiple “fingers” sensing different areas on a single object and narrowing down the possibilities very quickly through apical feedback.

Samin · February 20, 2018, 2:07pm

@rhyolight , Unsupervised real-time anomaly detection for streaming data can we apply for the multiple fields like more than 100 . Is it possible ??

rhyolight · February 20, 2018, 2:26pm

In theory, yes. But in practice in NuPIC, less than 10. You might want to look into encoding strategies for that much data to encode only the meaningful semantics.

sheiser1 · February 20, 2018, 3:33pm

He could run a small swarm even with that many variables theoretically right?

rhyolight · February 20, 2018, 3:58pm

A small swarm returns completely worthless results. It really should be called a “debug” swarm, because it just kicks the tires of the swarm configuration and ensures that it can run properly, but it does not permute over all the variables it needs to find good model params.

sheiser1 · February 20, 2018, 5:24pm

And a medium or large swarm would take a really long time with that many fields right?

rhyolight · February 20, 2018, 5:25pm

Yes. A really long time. It has to create and run NUPIC models that include lots of different combinations of encoder settings for each field. Adding a field is probably nearly an exponential increase in swarm time.

Samin · February 21, 2018, 5:39am

I am gonna try for 100 columns now , could you please suggest some thing , how to change encoding strategies .? Any example or reference you may provide .
Thanks for the valuable reply.

rhyolight · February 21, 2018, 5:19pm

100 columns is still way too many to swarm over. I suggest less than 10 scalar fields. Encoding strategies will depend heavily on the type of data and what it represents. Here are some resources:

steinroe · April 15, 2020, 3:16pm

Hi all,

thanks for the great input!

In the supplementary material S4 of the paper “Unsupervised real-time anomaly detection for streaming data” you propose a simple method to combine the prediction errors of individual errors into one anomaly likelihood. In the text, you mentioned that this was implemented somewhere. Could you point me to a repo? Or is it somewhere within nupic?

From the discussion here another question arised: Assuming I have less than 10 attributes. Is it better to combine the encodings and use one model or would you rather use the method mentioned in the paper and combine the prediction errors of the individual models? And, if I have a lot more than 10 attributes, lets say 100, do you think that combining the errors into one anomaly score would still yield at least “okay” results?

Thanks!
Philipp

sheiser1 · April 15, 2020, 7:02pm

The more attributes you have the more dubious it is to use a single model. This is because you’re cramming all that dimensionality into one SP, which gives each column a heavier representational load.

10 attributes may still be ok for a single model, but I think multi-model is generally better practice. Besides preventing representational overload, multi-model offers a finer look into the system. It lets you monitor each attribute and look for simultaneous anomaly spikes among sub-groups. This can offer insight into which attributes are influenced by the same underlying forces, causing them to become unpredictable at the same times.

For >>10 you should go with multi-model for sure. I’d recommend using the anomaly likelihood values fo each attribute instead of the raw anomaly scores though. With very noisy attributes you’ll get high anomaly scores all the time, which can skew your total anomaly score. The anomaly likelihood will essentially normalize for this.

Also along with the total anomaly approach (combining all scores into 1), you could also look for times when more than a certain proportion of the attributes have high anomaly likelihood (say 15%). It could be the case that the total anomaly value isn’t that high cause most attributes are acting predictably, while a small subset are acting anomalously. The threshold approach should catch this sub-group while the total anomaly approach might miss it.

steinroe · April 16, 2020, 9:38am

Thanks for the detailed answer, it brings a lot of clarity!

Do you know if anyone already implemented the Multivariate Anomaly Detection as described in the paper? I don’t want to repeat work already done but rather built upon it.

Topic		Replies	Views
Anomaly Detection for Multivariate TimeSeries Data NAB question	2	1695	December 31, 2018
Supervised multivarient Anomaly Detaction by using HTM Education question , community , nupic-wiki	2	568	July 25, 2020
Anomaly Detection - Hot Gym Tutorial : How to consider several variables? NuPIC	4	1442	November 2, 2018
Explainability and HTM Numenta Theory	2	493	July 6, 2019
Handling multivariate data with hundreds of variables Applications	5	741	July 9, 2019

Can HTM be used on multivarient time series problems?

Related topics