Strange anomaly likelihood plot

I am using code from here:

Made my changes to run anomaly detection on my dataset:

  1. The plot looks strange.
    Although, the values in the output csv seem to be more legitimate.
    The output csv above also have some values as high as 0.9. But is is clearly not visible in the graph.

  2. I generated modelParams using getScalarMetricWithTimeOfDayAnomalyParams()
    But I am not sure how to select “n” and “w”?

  3. Since I need to pass min and max values in the csv for generating model parameters and also in the modelParams again I need to write min and max value of the feature I wanna track. But what if the future data coming from the data-stream has values higher or lower than described above? Will the anomaly-detection continue to handle them without breaking?


1 Like

Solved part (1). There were a lot a duplicates in my training data. Once i fixed that, everything looks fine. Used Pandas function pandas.DataFrame.drop_duplicates to address this.

Still waiting for answers to part (2) & (3).

I’ll explain these things in a short video, using a visualization I used for the Scalar Encoder episode of HTM School.

In addition to increasing the n value, you might also try lowering the w value (but keep it an odd number).

1 Like

How could I decide the best value of n and w automatically to encode a completely new scalar data stream?

Does the new scalar data stream have a new min/max? You could use the min/max to identify a resolution for the RandomDistributedScalarEncoder, which is also explained in HTM School. Here is an example of how we use min/max to get a resolution:

There is no [“value”] field in modelParams generated using getScalarMetricWithTimeOfDayAnomalyParams().

No, the value is the name of the encoder in this case. It is referring to:

Oh. OK. How do I compute n and w from this “resolution”? Also, how do we determine the value of above given “numBuckets” ?

The RDSE docs say this:

The only required parameter is resolution, which determines the resolution of input values.

The numBuckets is used to compute the resolution in the example above because it is a little easier to reason about. @scott might have more to say here.

Now I don’t need to worry about n and w. The only parameters is resolution which greatly affects the amount of anomalies detected on my dataset. Could you suggest what are the reliable ways to determine “resolution”?

Any suggestions on this?

The resolution depends on your data. How many continuous values do you want to be stored in each bucket? I think this is your resolution.

If I set resolution as the minimum distance between any 2 data samples, will that be fine?

You probably want at least some overlap in the encodings for nearby values. Recommended reading: Encoding Data for HTM Systems.

1 Like