Anomaly Detection algorithm takes more than 20 hours

anomaly-detection
question

#22

@rhyolight and @subutai will do! HTM studio is already installed and I’ll get on it right now.

Thanks once again!


#23

BTW, how many data points would be suitable per client/country combination?


#24

We usually like to see a few thousand before we’re comfortable with the model’s output.


#25

So potentially 4 months of hourly data would be ok? 43024 = 2880 or would I need even more?


#26

It is honestly hard to say without seeing the data plotted out. If you cannot see any patterns with your human brain when the data is plotted, you might try changing the data bounds or aggregation settings. I always try to plot my data to ensure it looks like a decent candidate for anomaly detection. If the data is obviously just noise, there is not much HTM can do.


#27

This topic encouraged me to write up some general advice in this situation:


#28

Will post a couple of screenshots from the HTM studio, so you can see them…they have obvious daily and weekly pattern so it’s not noise (well some of them are…but mostly they are cyclic)

As far as encouragement, I am glad I gave a little contribution in that way :slight_smile:


#29

So here are just four of time series…this is just for one client, and obviously just 4 countries. You can see that they are not noisey.

Note that there are around 1500 data points per time series


#30

Looks good! Analyze it!


#31


#32

The question now is… what do you think? Are these anomalies valuable? They will only get better over time.


#33

I think this is feasible, and it makes sense…I would definitely go on and try to implement it. I can aggregate the data on whatever level since I have access to the database containing each event

I would just like to know the most effective way to do it :slight_smile:


#34

Well, if the only way you can do it is by processing all 1400 potential streams at the same time, you are going to need do to some serious engineering work to scale up to that level.


#35

What are my options? I suspect there are options a, b, c and such…

I am not running away from the task…so I would try whatever


#36

Well we have some open source code in https://github.com/numenta/numenta-apps for scaling, but I’m not sure it would apply to your situation or not. It’s caching method is to serialize models to / from disk between data calls because it expects data calls to come infrequently (less than 15m interval). At this interval, a really simple thing is to just save the models between data compute cycles. That frees up memory at the expense of disk space (quite a bit of disk space).

There is an “htm engine” in the code linked above that does this swapping type of caching. But honestly I would start from scratch and write something that makes sense for your infrastructure.


#37

Will definitely try it out.

Thanks for the n-th time, sir


#38

Watch this, even though the sample app doesn’t work anymore (I don’t think ):


#39

Hi guys, it’s me again…i aggregated my data at a 15 min interval and plugged it in HTM studio, but I have some questions…

How is it possible that this big big hole in the middle is not seen as an anomaly?

Furthermore…

The algorithm here recognized that sudden rise in level of this signal is anomalous, but how come it missed to detect this very obious drop marked yellow?

Couple of more confusing results below

Here algorithm missed the huge spike and a small one

and big dips here


#40

Real gaps in data are a problem. You have to make a decision about how to deal with them. Should you jump ahead in time and break the interval? Or use 0 or a flatline value for the duration? Or just use the last seen value for the duration? I think we usually choose to use the last seen value, which basically means “no new information”.

I’m honestly not sure what HTM Studio does with missing data like this.

Probably because it saw it before:

In the rest of your examples, I’m not sure why it is not picking up on the obvious spikes, but you can pretty easily add thresholds to find the outliers.

And for your last example, I would only say that it has seen similar levels of activity like it sees within those dips. I think it is honestly not surprising enough to cause an anomaly. Again, this threshold is something you can change if you build a solution with NuPIC. I think HTM Studio uses an anomaly likelihood threshold of 0.99999, but another threshold might be better with your data. Try 4 9s instead for a lower threshold.


#41

Hi Matt,

thanks again for the feedback. I know it may sound a bit dumb, but where exactly in code do I change this threshold? Sorry if it’s too trivial…just learning at this point.

Is it this bit of code?

def _filterLikelihoods(likelihoods,
                       **_redThreshold=0.99999_**, yellowThreshold=0.999):
  """
  Filter the list of raw (pre-filtered) likelihoods so that we only preserve
  sharp increases in likelihood. 'likelihoods' can be a numpy array of floats or
  a list of floats.

  :returns: A new list of floats likelihoods containing the filtered values.
  """
  redThreshold    = 1.0 - redThreshold
  yellowThreshold = 1.0 - yellowThreshold