Minimum size of timeseries data for anomaly detection (noob here)


#1

Hi, community!

Today is the second day I’m learning about HTM, I went through HTM School series (thank you, Matt!), played with hot gym tutorial.
I got a little bit impatient (too excited) and decided to ask my questions here (I did search this forum, though).

  1. What is the minimum amount of streaming data? That is, I have a year’s data aggregated by weeks, and I wonder, is 52 data points enough for anomaly detection?
  2. How do I combine multiple metrics? I have this kind of structure: week1: { “qty”: 12, “kg”: 14, “…”:"…" etc }, week2: { “qty”: 14, “kg”: 13, “…”:"…" etc }. I would be happy to get any pointers.
  3. How do I tweak model parameters for this kind of setup?

Can’t wait to get my hands dirty with real-world data!!!

Thank you!


#2

That’s not much data, but it depends. If you have a monthly pattern, you could see it emerge in 52 points (4 points per pattern).

First, ask yourself what you’re trying to accomplish? What is the question you want to answer? Do you want to predict something or identify anomalies?


#3

I’m working on churned telco subscribers (postmortem) analysis and I only have weekly data right now. I’ll let you know about my results.
What I’m trying to achieve is to find out what kind of anomaly might contribute to churn. May be we did something wrong, like having a massive outage of SMS service in some region etc.

I would be happy to monitor all of our subscribers real-time (a couple of millions of subscribers) and react to anomalies right away. I find this bottom-up approach more reasonable, because when we look at a batch level, we oversee local problems, but if we see the same kind of anomaly detected among a group of subscribers, we can say that something went wrong in that section.
Sounds scary, but hey, only traffic activity would be monitored (number of onnet, outbound minutes, data traffic megabytes etc).
Is it feasible at all (monitoring 2 mln subs)?


#4

Can you get more data in smaller increments? Our hot gym example is very 15 minutes, which is about 100 data points a day (20K a year). That is good enough to generally learn daily and weekly patterns. You might find anomalies in some subscriber data at this level.


#5

@jumasheff,

What you’re proposing is a data science exercise where you are building a predictive model for customer churn. Typically this starts with a number of input variables, and anomaly scores from your SMS service uptime is one of them.

So if you look at a diagram like this one, HTM can help you in step 2 with the feature engineering part, and step 1 is where @rhyolight was suggesting you start :stuck_out_tongue:.

Once you’ve completed steps 3 and 4 (using something like a logistic regression), you may end up establishing a causal relationship between the anomaly score and customer churn. Remember that there could be other variables at play as well, for example service outages may only cause a particular cohort to churn (e.g. those on a heavy use plan, or those in a particular age bracket). This is why you need a model rather than trying to go directly from the anomaly score to the result.

You’ll also need to do some feature engineering on your churn data too, as customers may not leave immediately (e.g. they might start researching other providers over a week or two).

Step 5 is where your HTM network goes into production, feeding anomaly scores into your model to generate churn predictions.

Good luck!


#6

Thanks! I’ll experiment with the pricision you proposed.

Thank you for your hints! Actually, I was playing with scikit-learn (inspired by this notebook) and Gradient Boosting Classifier was the best among others for me so far. I’m in the steps 2,3,4 loop. But what really worries me in that approach is that the data we feed to algos is just a snapshot and doesn’t reflect dynamics. A subscriber who is about to leave doesn’t dissappear all of a sudden, their traffic will have some anomalies (gradual declines). This may be the wrong place to ask, but, how do I reflect dynamics (day-to-day, or week-to-week changes) in a training dataset? Could you give me a couple of hints on this?

Anyway, we’ll definitely use nupic for anomaly detection (for subs base health-check, for example).


#7

I see, thanks for providing the context.

What I’m trying to achieve is to find out what kind of anomaly might contribute to churn. May be we did something wrong, like having a massive outage of SMS service in some region etc.

It sounds from this like you have some system metrics (outbound minutes, data traffic) but not direct outage metrics (seems unusual for a service provider!). If so, HTM could potentially be used to do feature engineering on your systems metrics, to help build a data set to represent potential outages.

A subscriber who is about to leave doesn’t dissappear all of a sudden, their traffic will have some anomalies (gradual declines).

This sounds like rather than outages, you’re looking at unusual usage patterns. Again, you could use HTM on each individual metric that relates to a customer’s use of your service, and generate an anomaly score to feed into your churn model.

But what really worries me in that approach is that the data we feed to algos is just a snapshot and doesn’t reflect dynamics.

There will naturally be some level of aggregation over a batch window, as a second-by-second or minute-by-minute anomaly activity would be highly unlikely to have meaning in this churn model. But regardless of granularity, each new anomaly score will still incorporate the historical data as the HTM network has been built on it.