Anamoly Model Detection


#1

Hi,
I created a model for anamoly detection without swarming while testing the model on a real stream of data i had certain findings,

  1. The anamoly likelihood of the model becomes constant after learning certain pattern it stop changing after about reading 1000 points from my stream.

  2. I had around 29 features fed to the model i just wanted to know is their any way i could find optimize value of n and w for all my features .

  3. how can i improve my model please help

  4. I also had one more concern is it fyn to provide these many feature as input to a nupic anamoly detection model

  5. Is my anamoly detection model purely depedent on the predict field or does it take the entire 29 column into consideration. while finding anamoly likelihood


#2

Let’s start here. To optimize, we need to understand these features better. What do they represent and how are they correlated? What are their original data types, and how are you encoding them?


#3

Hi rhyo,

My main objective is to find server anamoly in real time so the major parameter are load on the server per min per hour per second ,memory param like disk read count disk write count .The real stream gives me 29 features of server data .
Actually while going through one video i got to know that correlation of feature actually does not affect anamoly detection it takes into consideration all the features and learn the pattern accordingly.


#4

Thats why i included all the features that i got from the stream…
Most of the feature and either integer or float value so i encoded by adaptive scalar encoder and thier have been some features which are categorical in nature so i used SDRCategory encoder


#5

A couple quick recommendations:

  • Try the Random Distributed Scalar Encoder (RDSE). My understanding is that it is usually favored over the Adaptive Scalar Encoder in practice.

  • Try one model for each feature, or at least 5 or less per model. Too many features can basically muddy the waters and make predictive signal harder to find. Having multiple features correlated to each other usually doesn’t help, so if you can identify which features behave similarly you can drop those that are redundant.


#6

k …i will try it out and let u know my findings


#7

Here is one way to try to find out what input fields are correlated to one important field. It might help, but it will take some programming.

A swarm is going to try to optimize model parameters for prediction of one input field. So in order to swarm you have to pick one field of input as your predictedField. If you’re just doing anomaly detection, it may be hard to figure out what field that is, but I would choose one with the most obvious patterns (least noisy). The swarm will return model params that only include encoders for the fields it found affected the prediction accuracy. In many cases, the only field worth encoding is the predictedField, meaning processing the rest is wasted (you mentioned this above). But it hopefully will encode other fields that will indicate that those are the good ones to feed into NuPIC.

So my advice is to reduce the amount of input fields, 29 is too many. Analyze your data a bit to reduce it to a few important input fields.


#8

Hi @rhyolight and @sheiser1 thank you i am able to detect a predicted field by reducing the features from 29 to 5. and RDSE is really better in giving output at real time.
I wanted to know how to provide mulitple predict field while building anomoly detection model.
Please help me for that.


#9

I always hate pointing this out but… the code to extract multiple predictions has not been written yet. See Predicting Multiple Output Values.


#10

To pass in multiple fields you need to setup the model params file for it. I’d recommend looking at this example:

Its basically a big nested dict structure, where ‘modelParams’ contains ‘sensorParams’, which contains ‘encoders’. Within ‘encoders’ there are sub-dicts for each field, in this case ‘kw_energy_consumption’, ‘timestamp_dayOfWeek’, ‘timestamp_timeOfDay’ and ‘timestamp_weekend’.

Each of these fields is encoded separately, then they’re all combined into one which is input to the Spatial Pooler & TM. Each different data type obviously has its own set of encoding parameters, and once you fit them accordingly the multi-encoder will automatically combine them into one model.

You can get a sense why its good not to have too many input fields, since the one model has to represent more dimensions the more fields there are. Along with your own dimensionality reduction approach it may be worth it try swarming too.

– Oops, seems I’ve answered the wrong question :sweat_smile:


#11

Hi @rhyolight sorry to bother u i already went through all these article …i just wanted to confirm.