Swarm - How many data points? When to rerun?

helena_Thielen · October 10, 2019, 4:22pm

Hey everybody,

I already searched for hours in this forum, but couldn’t find anything about my question:neutral_face: (but read tons of interesting posts ).

so my first question is, how can I decide about the number of data points I should use to run a swarm? I know that the swarm is just to find the parameters for the model.
In one post i read I should take around 3000.
The guy asked for reasons why exactly this number, but the question remained unanswered… → see last post at:

Maybe someone can help me with this
And also: does it matter if i take the last points or the first ones or some random ones in between?

My second question is the following one:
If I have a temporary anomaly (do the statistics and maybe also the min and max value of the data change), do I have to run a new swarm?

Thanks a lot in advance
Helena

rhyolight · October 10, 2019, 4:29pm

Before spending a lot of time with your questions, I want to make sure you’ve read the Swarming Guide in our API docs. This section in particular.

helena_Thielen · October 10, 2019, 4:31pm

Thanks I’ll do that

sheiser1 · October 11, 2019, 1:42am

I’d say if you’re doing anomaly detection forget about swarming! Swarming searches for a set of model hyper-parameters that optimizes the model’s prediction accuracy on one chosen ‘predicted’ field – not on anomaly detection performance.

I’d highly recommend looking into this function:

getScalarMetricWithTimeOfDayAnomalyParams

imported from: nupic.frameworks.opf.common_models.cluster_params

This function will return a set of hyper-parameters like swarming does, though for a uni-variate model. For multivariate anomaly detection my approach is to track multiple models (one for each variable) and declare system anomalies when many of the independent models show anomalies simultaneously.

helena_Thielen · October 15, 2019, 12:32pm

Hello ,

Sure…when I am doing anomaly detection that’s another thing.
But say the statistics and the values change and I want to do PREDICTION…
Do I have the rerun the swarm or not?

I read the two papers twice… but can’t find an answer to my question:

I would really appreciate an answer to that.
Thanks a lot in advance

sheiser1 · October 15, 2019, 3:36pm

The swarming process generates many candidate models each with different hyperparameter value combinations, and evaluates those models over the course of N time steps. So the larger N is, the more thorough an evaluation each candidate model gets. Given this, it’d be theoretically ideal to make N as large as possible so each candidate model has maximal time to prove it’s worth.

The major problem with large N is that it can make the whole process take a REALLY long time, since each candidate model needs to be initialized, hit with N data points and evaluated from N MAPE values I believe (mean avg % error). This can be mitigated though by setting the swarm size to ‘small’ or ‘medium’ at most, which caps the number of candidate models by limiting the set of different hyperparameter combos to try.

The other major * in my mind about swarming is that it is evaluating each model by one specific criteria: how well did this model forecast the value of metric X1? Swarming helps to find model configs with the sole goal of minimizing forecasting MAPE for X1, so no guarantees for X2 or any other metric nor anomaly detection performance, since that’s a different objective.

If the statistics of the data have changed so much that the model hyperparameters are no longer valid, yes that’d theoretically call for re-running the swarm. But this should hopefully be quite unlikely barring some kind of tectonic shift in the data, where instead of X1 ranging mostly from 0 to 1 it is now 100 to 1000 or something.

Having a larger N in your swarm should make this scenario less likely since the chosen candidate model is vetted on more data, though if your N is 3000 and the tectonic shift happened only after 3000 I suppose the issue would persist.

One other thing that running the swarm periodically would do is limit the ability for continuous learning, since each new swarm run would yield a new-born model to train from 0. You’d also need to introduce some criteria to trigger a new swarm, having declared the current model obsolete. Barring that tectonic shift scenario the continuous learning nature of HTM should make this unnecessary!

rhyolight · October 15, 2019, 4:02pm

A very simple (and very possibly wrong) answer is 1000 data points. But I say “very possibly wrong” because it depends on the data. More specifically, it depends on the temporal patterns and the time scales in which they express themselves & the actual period of data availability. The questions “what time interval is between data points?” and “Are the intervals always the same?” are important to answer now. You may need to take some data samples and plot them at different intervals to find these answers.

The anomaly models that @sheiser1 suggested will not work well for prediction. You will need to run a swarm to get a model that is optimized to perform prediction on a specific field of your data. If you want both anomaly detection and prediction, you might find that the canned anomaly model will perform better anomaly detection than the model tuned for field prediction.

helena_Thielen · October 16, 2019, 3:17pm

Hello,

thank you very much for your replies. You really helped me a lot @rhyolight @sheiser1
I think I get it now.

rhyolight · October 16, 2019, 3:20pm

Another tip: if a parameter returned by the swarm doesn’t make sense to you, bring it up. Swarms will never find the best parameters, they are just a tool to help you manually tune the network for your data. It is always best to look at the parameters and understand what they mean if you can.

helena_Thielen · October 16, 2019, 3:25pm

ok cool, nice hint

Topic		Replies	Views
How to use NUPIC for simulated streaming data NuPIC	14	3385	December 31, 2017
Should I 'swarm' some sample data? NuPIC	18	1399	June 5, 2017
Swarm description for hot gym anomaly example? NuPIC swarming , anomaly-detection	3	902	August 19, 2016
Swarming on anomaly detection mission NuPIC	2	738	August 5, 2019
Does Swarming produces correct model param? NuPIC	2	843	April 30, 2018

Swarm - How many data points? When to rerun?

Related topics