How to approach anomaly detection in htm.core with multivariate data

SHA2 · June 24, 2021, 3:21pm

Hi all,
I’m new to HTM and would like to ask for your advice on anomaly detection in multivariate data
The data looks like the following

timestamp, category, value

03/04/21 07:10, CAT1, 5
03/04/21 07:10, CAT2, 2
03/04/21 07:10, CAT3, 3
03/04/21 07:10, CAT4, 7
03/04/21 07:10, CAT5, 1
03/04/21 07:15, CAT1, 9
03/04/21 07:15, CAT2, 3
03/04/21 07:15, CAT3, 4
03/04/21 07:15, CAT4, 2
03/04/21 07:15, CAT5, 6
03/04/21 07:20, CAT1, 2
03/04/21 07:20, CAT2, 21
03/04/21 07:20, CAT3, 44
03/04/21 07:20, CAT4, 4
03/04/21 07:20, CAT5, 12

I played with hotgym example using htm.core (python 3.7) and applied it for a single category anomaly detection (using only timestamp and value columns of a chosen category).
Now, I’m having difficulty understanding how I can apply that example with multiple categories.

I have a few days of data with 1-minute interval of over 1000 categories which amounts to over 5 million records in the above format. The categories are not fixed in number as new categories may appear as time progresses.

My usecase is, I have to find anomalies in each category and also if any of the categories is deviating from the rest.

Can anyone guide me (preferably with a sample or settings) for multivariate data in htm.core and how should I structure my data?

Thanks for any guidance.

sheiser1 · June 24, 2021, 7:00pm

Hey @SHA2, welcome!

For this purpose I’d make separate models for each category. I’d look for more anomalous activity in each model as peak times of the anomaly likelihood. The anomaly likelihood uses a sliding window of recent anomaly scores – so could not work for some categories if they have much smaller data than others. In these cases I’d use raw anomaly scores. One nice benefit of this approach is you could remove the CAT column from each model’s data, since each model is a single CAT.

For this purpose you could make 1 master model (using say the first 70% of all categories data), and then feed in the other 30% of each category’s data separately – tracking the anomaly scores for each. I’d compare these anomaly score distributions for each category and look for any stand-outs.

Those categories which are more abnormal relative to all data should have higher anomaly scores overall. As long as the data is in sequential order you shouldn’t need the CAT column for this model either.

Also you may consider dropping the timestamp as well. In order for any feature to help it should be periodic – so if your data only covers a few days it won’t help to encode the year, month or day of week. However the time of day alone could help, since that does repeat over the span of a few days.

Best of luck and hope this helps!

SHA2 · June 25, 2021, 10:05pm

Thanks @sheiser1 for your advice.

How should I structure my input data to create a master model?

sheiser1 · June 25, 2021, 10:36pm

I think you can just have 1 column of data sorted in time order, with the raw values. It’ll contain data from all categories. To avoid data leak you should split the data for each category, so that there’s some held aside from training from each category,

Then you can feed these test data sets for each category to the model, using their 1 column of raw values, and tracking the distribution of anomaly scores for each.

SHA2 · June 26, 2021, 4:55pm

Thanks @sheiser1

here is my understanding.

Create a model for each category to find anomalies in each category with input data as single column time-ordered values.
For every time unit iteration compare the anomaly distribution among all categories by calculating their mean, variance and/or standard deviation to find which anomaly is deviating from the rest.
Create another model for just the time-ordered anomaly distribution with their mean, standard deviation and/or variance as columns to find anomalies in the anomalies.

Am I missing anything?

Appreciate the help.

sheiser1 · June 26, 2021, 7:03pm

Right, this will tell you where the anomalies are within each category.

Yes, if I understand right. In order to compare the presence of anomalies across the different categories, you can train a ‘master’ model with the first say 70% of all data from each category. Then feed the remaining 30% of each category thru the model with learning off (so its not still learning from these test sets). From that you’ll have a distribution of anomaly scores for each category against the master model.

You could also make this comparison thru their anomaly score distributions from the learning process. The categories whose anomaly scores settle down faster & stay lower overall are more predictable.

I guess I’m not sure how this is different from the ‘master’ model concept just mentioned.

SHA2 · June 26, 2021, 8:25pm

Both points 2 & 3 in my above post belong to the same master model.
In point 2, I’m trying to detect deviations among anomalies of all categories in a single instance of time, and in point 3, I’m trying to detect deviations among anomalies of all categories compared with the history

Topic		Replies	Views
Anomaly Detection for Multivariate TimeSeries Data NAB question	2	1697	December 31, 2018
Supervised multivarient Anomaly Detaction by using HTM Education question , community , nupic-wiki	2	568	July 25, 2020
Anomaly detection on Time Series Data with multiple catogory NuPIC anomaly-detection	9	1126	March 12, 2018
Anamoly detection with HTM NuPIC	2	819	January 22, 2018
Explainability and HTM Numenta Theory	2	493	July 6, 2019

How to approach anomaly detection in htm.core with multivariate data

Related topics