How to approach anomaly detection in htm.core with multivariate data

Hi all,
I’m new to HTM and would like to ask for your advice on anomaly detection in multivariate data
The data looks like the following

timestamp, category, value


03/04/21 07:10, CAT1, 5
03/04/21 07:10, CAT2, 2
03/04/21 07:10, CAT3, 3
03/04/21 07:10, CAT4, 7
03/04/21 07:10, CAT5, 1
03/04/21 07:15, CAT1, 9
03/04/21 07:15, CAT2, 3
03/04/21 07:15, CAT3, 4
03/04/21 07:15, CAT4, 2
03/04/21 07:15, CAT5, 6
03/04/21 07:20, CAT1, 2
03/04/21 07:20, CAT2, 21
03/04/21 07:20, CAT3, 44
03/04/21 07:20, CAT4, 4
03/04/21 07:20, CAT5, 12

I played with hotgym example using htm.core (python 3.7) and applied it for a single category anomaly detection (using only timestamp and value columns of a chosen category).
Now, I’m having difficulty understanding how I can apply that example with multiple categories.

I have a few days of data with 1-minute interval of over 1000 categories which amounts to over 5 million records in the above format. The categories are not fixed in number as new categories may appear as time progresses.

My usecase is, I have to find anomalies in each category and also if any of the categories is deviating from the rest.

Can anyone guide me (preferably with a sample or settings) for multivariate data in htm.core and how should I structure my data?

Thanks for any guidance.

1 Like

Hey @SHA2, welcome!

For this purpose I’d make separate models for each category. I’d look for more anomalous activity in each model as peak times of the anomaly likelihood. The anomaly likelihood uses a sliding window of recent anomaly scores – so could not work for some categories if they have much smaller data than others. In these cases I’d use raw anomaly scores. One nice benefit of this approach is you could remove the CAT column from each model’s data, since each model is a single CAT.

For this purpose you could make 1 master model (using say the first 70% of all categories data), and then feed in the other 30% of each category’s data separately – tracking the anomaly scores for each. I’d compare these anomaly score distributions for each category and look for any stand-outs.

Those categories which are more abnormal relative to all data should have higher anomaly scores overall. As long as the data is in sequential order you shouldn’t need the CAT column for this model either.

Also you may consider dropping the timestamp as well. In order for any feature to help it should be periodic – so if your data only covers a few days it won’t help to encode the year, month or day of week. However the time of day alone could help, since that does repeat over the span of a few days.

Best of luck and hope this helps!

2 Likes

Thanks @sheiser1 for your advice.

How should I structure my input data to create a master model?

I think you can just have 1 column of data sorted in time order, with the raw values. It’ll contain data from all categories. To avoid data leak you should split the data for each category, so that there’s some held aside from training from each category,

Then you can feed these test data sets for each category to the model, using their 1 column of raw values, and tracking the distribution of anomaly scores for each.

1 Like

Thanks @sheiser1

here is my understanding.

  1. Create a model for each category to find anomalies in each category with input data as single column time-ordered values.
  2. For every time unit iteration compare the anomaly distribution among all categories by calculating their mean, variance and/or standard deviation to find which anomaly is deviating from the rest.
  3. Create another model for just the time-ordered anomaly distribution with their mean, standard deviation and/or variance as columns to find anomalies in the anomalies.

Am I missing anything?

Appreciate the help.

1 Like

Right, this will tell you where the anomalies are within each category.

Yes, if I understand right. In order to compare the presence of anomalies across the different categories, you can train a ‘master’ model with the first say 70% of all data from each category. Then feed the remaining 30% of each category thru the model with learning off (so its not still learning from these test sets). From that you’ll have a distribution of anomaly scores for each category against the master model.

You could also make this comparison thru their anomaly score distributions from the learning process. The categories whose anomaly scores settle down faster & stay lower overall are more predictable.

I guess I’m not sure how this is different from the ‘master’ model concept just mentioned.

Both points 2 & 3 in my above post belong to the same master model.
In point 2, I’m trying to detect deviations among anomalies of all categories in a single instance of time, and in point 3, I’m trying to detect deviations among anomalies of all categories compared with the history

1 Like