Anomaly Detection algorithm takes more than 20 hours

anomaly-detection
question

#1

Hi guys,

I am new to NUPIC world and I would much appreciate your help. I have a data set of just under 1M rows with 4 columns:
senddatehour, channelid, countryid, volume

I am running an anomaly detection algorithm just like the one from one hot gym example. However, this algorithm is running for last 20 hours and it’s still outputting anomaly scores on the console.

My question is: Is there a way to spped up the process? Maybe use multiple cores or something like that?

Thanks in advance,
Roko


Splitting events into temporal streams
#2

Wow… Wait… 1M rows is a lot. Topically we (at least I) deal with something less than 2000 rows. Maybe try to use less rows?

You can also try NuPIC.core instead of NuPIC. This should give you some performance boost.


#3

How many rows has it processed after 20 hours?

What are you planning on doing with the results? printing to the console takes compute time. Are you sure that is not slowing down the process? Try caching the results in memory and dumping them to file if it gets too big. I just want to make sure the NuPIC is the bottleneck in your compute loop.


#4

Hi Matt,

thanks for your reply. In the meantime, the algorithm finished.

I am aware that printing to the console takes compute time, but I was kind of hoping to see if there are other workarounds that could potentially speed up the process. Caching seems like a good idea…will definitely try it in another go.

And to answer your question, the goal of this analysis is to generate a POC for my team leader. We plan to use NUPIC for anomaly detection in production, but I fear that he won’t be too happy with this execution time.

Kind regards,
Roko


#5

I have 2 months of data aggregated on an hourly level. So around 1400 timestamps per combination of client and country. Currently I monitor 4 clients that send messages to around 200 countries across globe.

Do you think maybe less timestamps would be ok?

Reducing the number of clients and countries is not an option unfortunately.

Thanks for you reply, much appreciated


#6

Typically, on standard Macbook Pros, a generic scalar anomaly detection model typically computes at about 20ms per cycle. If you are getting drastically different compute times, you should investigate peripheral processes before trying to tune your NuPIC model.

No, I would stick to hourly.


#8

Thanks for you replies. I will definitely pay attention to your suggestions and try to bring the execution time as low as possible.

Expect to hear from me soon, because I will most likely dive deep into this subject since I feel, as a psychologist, that this story has solid grounds and I am eager to delve deeper into it.

Have a nice day folks!


#9

One more thing, please…Is it possible to somehow store the results of this model so that I don’t have to run the algorithm once again when I get new data?

How to make this real-time without restarting the process all over again? Or in other words, how to evaluate only this current hour (once we get it to production), so that anomalies are flagged as soon as possible?

Thanks once again


#10

You can serialize the algorithm instances and reload them later.


#11

Excellent, much grateful!


#12

Depending on the nature of the anomalies it may be worth grouping some of these client/country combinations into separate streams, fed into separate NuPIC models. If something is anomalous for a certain client it may not be for another one (for instance).

With a lot of data coming from different sources in one big stream I imagine it would take a lot longer for regular patterns to emerge, allowing the anomaly scores to settle down and thus giving the anomalies detected more meaning.


#13

True, it would be really helpful to see a few sample rows of input data @Roko_Gudic.


#14

That makes sense. I am currently looking at the results from the model and I fear that the S flag in the third row of the header didn’t do what I expected it to do. I set the the channel and country columns as sequences.

More specifically, there seem to be anomalies that are undetected this way.


#15
senddatehour channelid countryid volume
14.5.2018 21:00 42344 100 2380.0
14.5.2018 22:00 42344 100 1372.0
14.5.2018 23:00 42344 100 761.0
15.5.2018 0:00 42344 100 410.0
15.5.2018 1:00 42344 100 229.0
15.5.2018 2:00 42344 100 204.0
15.5.2018 3:00 42344 100 285.0

senddatehour channelid countryid volume
1.6.2018 1:00 85834 107 7.0
1.6.2018 2:00 85834 107 7.0
1.6.2018 3:00 85834 107 18.0
1.6.2018 4:00 85834 107 14.0
1.6.2018 5:00 85834 107 26.0
1.6.2018 6:00 85834 107 28.0
1.6.2018 7:00 85834 107 19.0
1.6.2018 8:00 85834 107 30.0
1.6.2018 9:00 85834 107 25.0
1.6.2018 10:00 85834 107 25.0
1.6.2018 11:00 85834 107 23.0
1.6.2018 12:00 85834 107 27.0
1.6.2018 13:00 85834 107 27.0
1.6.2018 14:00 85834 107 35.0
1.6.2018 15:00 85834 107 30.0
1.6.2018 16:00 85834 107 47.0
1.6.2018 17:00 85834 107 49.0
1.6.2018 18:00 85834 107 38.0
1.6.2018 19:00 85834 107 29.0
1.6.2018 20:00 85834 107 24.0

#16

With many large datasets, it is important to aggregate the data before sending it through NuPIC. This is what we did with the Taxi dataset. It will likely give much better results and will also run way faster. You could try importing 50K rows into HTM Studio - it may suggest some aggregation parameters that you can use.

The Hotgym examples don’t have the best anomaly detection scheme or parameters. I recommend using the NAB example source for doing anomaly detection:


#17

Yes - I agree. Looking at the data, it looks like it should be separated out by country and/or client.


#18

Seeing this data helps a lot.

Yes, I can see creating 1400 models, each having two months of hourly data, but that is only like 1500 data points for each model. I am not sure that is enough to get good results.

Can you aggregate the data any finer? Maybe 15 minute intervals? If you can get finer data, we can see finer resolution patterns, which could find hourly changes in data.

I would do this… Choose one country and client of your 1400 possible models and get that data stream into a CSV, either hourly or by 15m interval. Install HTM Studio and analyze it with that. If you see useful anomalies in HTM Studio, you have validated that HTM will help you, and you can decide whether it is worth it to build something that can handle 1400 live running models doing anomaly detection.


#19

@subutai, when you say aggregating…do you mean aggregating on say…two hours, four hours…or by some other mean?

I will definitely look at this example you posted and utilize other suggestions already provided.

Much much appreciated help :slight_smile:


#20

HTM Studio will help you find the right aggregation interval for whatever data you give it. I do suggest finding something perhaps every 15m.


#21

Sorry - I missed the post where you say the data is already aggregated. @rhyolight’s recent detailed post is best. Once HTM Studio is working well for you, use the NAB source for example source.