Why am I seeing lot of false positives?

I am trying the anomaly detection (just like hotgym) with my sample data and I seem to be seeing lot of false positives. I attached a slice of my data below.

During swarming I used TemporalMultiStep as inferenceType, though TemporalAnomaly was the one I was after. I don’t know if it matters, just pointing it out. During swarming, the model input minval/maxval was set to [0.0, 3.00]. I realize this range is very large though there may be just one spurious data point in the tail end of this range. The data point themselves move in increments of 0.01, typically.

Are there any tweaks I should do ? I feel like I am missing something because the delta between actual and prediction is small, though the anomaly likelihood is high.

1 Like

If you want a TemporalAnomaly model, you probably don’t need to swarm. You can use these params:

Of course, you will need to update the encoder parameters to match your input data, but those are generally good values for DateEncoder and RandomDistributedScalarEncoder. They are the ones we use for all our anomaly detection products, and it works generally well on most scalar data.

1 Like

In addition to Matt’s suggestion (which is highly recommended) I would point out the following:

It usually takes several hundred data points before the model is predicting well and you have enough points to compute anomaly likelihood accurately. From your graph it looks like you only have a few data points?

It is better if min/maxval are closer to your actual range.

Anomaly likelihood returns a probability that the current sample does not belong to the model so far. We don’t consider it an anomaly unless the probability is really close to 1, usually >= 0.99999. When plotting this is really hard to tell. It is more convenient to plot the log likelihood using computeLogLikelihood(), in which case values >= 0.5 would be anomalous.

The best example code for doing anomaly detection is in NAB:

Thanks @rhyolight and @subutai. Few clarifications

  • The chart was a slice after few hundred data points (somewhere after ~1K points or so]
  • The min/max I mentioned is indeed the real range. What I meant was one of the data point was a volatile event.

I gave Matt’s pre baked parameter a try. However I ran into an issue during my anomaly detection run.


Creating model from intc…
Importing model params from model_params.intc_model_params

Error in constructing RandomDistributedScalarEncoder encoder. Possibly missing some required constructor parameters. Parameters that were provided are: {u’seed’: 42, u’name’: ‘range’, u’numBuckets’: 130.0}

Traceback (most recent call last):
File “run.py”, line 153, in
runModel(GYM_NAME, plot=plot)
File “run.py”, line 141, in runModel
model = createModel(getModelParamsFromName(gymName))
File “run.py”, line 58, in createModel
model = ModelFactory.create(modelParams)
File “/usr/local/lib/python2.7/dist-packages/nupic/frameworks/opf/modelfactory.py”, line 80, in create
return modelClass(**modelConfig[‘modelParams’])
File “/usr/local/lib/python2.7/dist-packages/nupic/frameworks/opf/clamodel.py”, line 213, in init
clParams, anomalyParams)
File “/usr/local/lib/python2.7/dist-packages/nupic/frameworks/opf/clamodel.py”, line 1108, in __createCLANetwork
encoder = MultiEncoder(enabledEncoders)
File “/usr/local/lib/python2.7/dist-packages/nupic/encoders/multi.py”, line 75, in init
self.addMultipleEncoders(encoderDescriptions)
File “/usr/local/lib/python2.7/dist-packages/nupic/encoders/multi.py”, line 162, in addMultipleEncoders
self.addEncoder(fieldName, eval(encoderName)(**fieldParams))
TypeError: init() got an unexpected keyword argument ‘numBuckets’

I used the hotgym sample and modified it. May be there is some version mismatch and hence this field isn’t recognized ?

I am also going to try the NAB benchmark. It may be the easiest to just try my data point and run it. Will report back on what I find. Thanks!

Oh yes, there is one more cryptic step that needs to be done. The RDSE doesn’t actually take a numBuckets parameter, it takes a resolution. This is calculated on the fly based on the min/max of the input data and the numBuckets. You can see this here:

You will have to do something similar when creating the model. For details on RDSE parameters, see HTM School Episode 5: Scalar Encoding.

I tried the run_anomaly.py code as is from the nupic.workshop link above. It failed with type mismatch due the getMinMax code returning string’s. I modified them as below and got it working with the nyc_taxi data.

def getMinMax(dataFrame):
  return float(dataFrame.min().values[1]), float(dataFrame.max().values[1])

I plugged my own data with 1000 data points and it showed all 0’s as predicted values. My values are small in [0.05 - 0.09] range, but that shouldn’t matter right ? Are there any more tweaks I need to do ? I tried changing numBuckets, but that didn’t help. I could try running more data, but doubt that’s the issue here. The data is in

https://s3.amazonaws.com/datadump_nupic/nupic_data.csv

Strangely, I couldn’t attach the csv in response as it would only take images.

@rhyolight, is there any reason not to use the function getScalarMetricWithTimeOfDayAnomalyParams from https://github.com/numenta/nupic/blob/master/src/nupic/frameworks/opf/common_models/cluster_params.py? This function also sets numBuckets and resolution. At least, it may be used as an example.

1 Like

That’s a great idea, I forgot about that functionality. @voiceclonr please try @vkruglikov’s suggestion above.

@rhyolight - I assume you are recommending that in the context of integrating with the nupic.workshop/part-1-scalar-input code. I integrated it, but ran into issues because there seems to be conflict in the column names expected [“timestamp”,“value”] in this code vs what some other code downstream expects [“c0”,“c1”]. I will need to dig more. But FWIW, if I simply scale up all my data points (make them 10x), the run_anomaly.py code seems to atleast give some non-zero anomaly likelihood values.

Seems like that’s because you’re getting the wrong resolution for the RDSE. This is kindof a hodgepodge of examples mixed together, isn’t it :stuck_out_tongue:?

To clarify, the method @vkruglikov mentioned will provide with with the entire model params dict that you’ll use to create the model, so you don’t have to do the stuff I was showing you before.

Actually, the workshop examples are lot easier to run (for noobs and first timers like me). So I would say NY Taxis trump hot gym anyday :slight_smile: Just need to get it working correctly. The other option was to run the NAB with my custom data - but I don’t know how easy it is to make it run with a new data. Any thoughts on that ?

@rhyolight - I integrated @vkruglikov’s recommendation, modified rest of the pieces to have columns defined as [c0,c1]. The model creation code in run_anamoly.py looks like below -

`

def createAnomalyDetectionModel(dataFrame):
      with open(MODEL_PARAMS_PATH, "r") as dataIn:
        modelParams = json.loads(dataIn.read())
      minInput, maxInput = getMinMax(dataFrame)

      # RDSE - resolution calculation
      valueEncoderParams = \
        modelParams["modelParams"]["sensorParams"]["encoders"]["value"]
      numBuckets = float(valueEncoderParams.pop("numBuckets"))
      resolution = max(0.001, (maxInput - minInput) / numBuckets)
      valueEncoderParams["resolution"] = resolution

      # Convert to an anomaly detection model
      modelParams["modelParams"]["inferenceType"] = "TemporalAnomaly"
      # Overwrite previous settings
      params = getScalarMetricWithTimeOfDayAnomalyParams(
            metricData=dataFrame['c1'],
            tmImplementation="cpp",
            minVal=minInput,
            maxVal=maxInput)

      print  (json.dumps(params, sort_keys=True, indent=4))

`

Even this gives just 0’s as predicted values. Something more is needed looks like.

I don’t think it will be easy to modify NAB to run on your data - I would avoid that path. The NAB numenta_detector code is the best example code to use for your own code. I think someone is working on an anomaly detection example within NuPIC that includes all these best practices in a simple script.

Note that for speed reasons the anomaly detection code typically won’t use the classifier and therefore won’t generate predictions.

Thanks for the tip @subutai. I came to same conclusion after couple of days. I would love to see the simplified wrapper code and I am only interested in investigating anomalies for now.