Should I 'swarm' some sample data?

Im getting some v mixed results looking for anomalies using HTM.Java.

Here’s a chart after a fair bit of massage:

The top chart is the data and the bottom is likelihood plotted as:

-100*(ln(1.0000000001-likelihood)/ln(1-0.9999999999))

Im wondering if it’s worthwhile to run the data through python / swarm to choose better parameters.

The results aren’t bad but it seems like a great deal of effort to produce.

By all means, it couldn’t hurt. What parameters are you currently using?

Also, is this data labelled? Can you tell when actual anomalies occurred?

Can you tell when actual anomalies occurred?

Those spikes that have risen above the background noise. For this sample anything above 6-800 is a good candidate.

You can use an anomaly likelihood threshold to dial up/down the number of anomalies you’re getting. It looks like all the anomalies it found were over 600 like you said, so it is missing some of them. Lower the threshold until you get something more like what you want.

I don’t understand why you’re doing this? We usually just set a threshold like 0.99999 and say any anomaly likelihood value over that threshold should be considered an anomaly. Then adjust as needed.

I used that likelihood calculation to try to weed out some of the noise.

The anomaly likelihood calculation itself is what is weeding out the noise, from the anomaly score (not shown).

Here are the suggested params:

MODEL_PARAMS = {'aggregationInfo': {'days': 0,
                 'fields': [(u'stamp', 'first'), (u'result', 'sum')],
                 'hours': 0,
                 'microseconds': 0,
                 'milliseconds': 0,
                 'minutes': 0,
                 'months': 0,
                 'seconds': 1,
                 'weeks': 0,
                 'years': 0},
 'model': 'CLA',
 'modelParams': {'anomalyParams': {u'anomalyCacheRecords': None,
                               u'autoDetectThreshold': None,
                               u'autoDetectWaitRecords': None},
             'clParams': {'alpha': 0.003336578312138331,
                          'regionName': 'SDRClassifierRegion',
                          'steps': '1',
                          'verbosity': 0},
             'inferenceType': 'TemporalMultiStep',
             'sensorParams': {'encoders': {'_classifierInput': {'classifierOnly': True,
                                                                'clipInput': True,
                                                                'fieldname': 'result',
                                                                'n': 33,
                                                                'name': '_classifierInput',
                                                                'type': 'AdaptiveScalarEncoder',
                                                                'w': 21},
                                           u'result': None,
                                           u'stamp_dayOfWeek': {'dayOfWeek': (21,
                                                                              1.1076204575827224),
                                                                'fieldname': 'stamp',
                                                                'name': 'stamp',
                                                                'type': 'DateEncoder'},
                                           u'stamp_timeOfDay': None,
                                           u'stamp_weekend': None},
                              'sensorAutoReset': None,
                              'verbosity': 0},
             'spEnable': True,
             'spParams': {'boostStrength': 0.0,
                          'columnCount': 2048,
                          'globalInhibition': 1,
                          'inputWidth': 0,
                          'numActiveColumnsPerInhArea': 40,
                          'potentialPct': 0.8,
                          'seed': 1956,
                          'spVerbosity': 0,
                          'spatialImp': 'cpp',
                          'synPermActiveInc': 0.05,
                          'synPermConnected': 0.1,
                          'synPermInactiveDec': 0.1},
             'tpEnable': True,
             'tpParams': {'activationThreshold': 12,
                          'cellsPerColumn': 32,
                          'columnCount': 2048,
                          'globalDecay': 0.0,
                          'initialPerm': 0.21,
                          'inputWidth': 2048,
                          'maxAge': 0,
                          'maxSegmentsPerCell': 128,
                          'maxSynapsesPerSegment': 32,
                          'minThreshold': 9,
                          'newSynapseCount': 20,
                          'outputType': 'normal',
                          'pamLength': 1,
                          'permanenceDec': 0.1,
                          'permanenceInc': 0.1,
                          'seed': 1960,
                          'temporalImp': 'cpp',
                          'verbosity': 0},
             'trainSPNetOnlyIfRequested': False},
 'predictAheadTime': None,
 'version': 1}

Some of these have obvious analogs in the Java params but many do not. Working on copying these over now.

Something strange in those params. In sensorParams.encoders it looks like it is only encoding “day of week” and not even the result value (which I assume is the predicted field?). I don’t think the swarm was set up properly. How are you doing it? Do you have a swarm definition to share?

{
  "includedFields": [
    {
      "fieldName": "stamp", 
      "fieldType": "datetime"
    }, 
   {
      "fieldName": "result", 
      "fieldType": "int"
    }
  ], 
  "streamDef": {
  "info": "test", 
  "version": 1, 
  "streams": [
  {
    "info": "radar_anom_sample.csv", 
    "source": "file://data/radar_anom_sample.csv", 
    "columns": [
      "*"
    ] 
  }
  ], 
  "aggregation": {
  "hours": 0, 
  "microseconds": 0, 
  "seconds": 1, 
  "fields": [
      [
          "result", 
          "sum"
      ], 
      [
         "stamp", 
         "first"
      ]
  ], 
  "weeks": 0, 
  "months": 0, 
  "minutes": 0, 
  "days": 0, 
  "milliseconds": 0, 
  "years": 0
  }
  }, 
  "inferenceType": "MultiStep", 
  "inferenceArgs": {
  "predictionSteps": [
      1
   ], 
   "predictedField": "result"
 }, 
 "iterationCount": -1, 
 "swarmSize": "medium"
}

Give result a max and min value like this (but integers):

I’m not sure if you’re trying to use the aggregation functionality, but I would not suggest it. I find it much easier to deal with data manipulations outside of NuPIC. So in that case, you can remove the entire aggregation section.

k. I’ll update and rerun.

MODEL_PARAMS = {'aggregationInfo': {'days': 0,
                 'fields': [],
                 'hours': 0,
                 'microseconds': 0,
                 'milliseconds': 0,
                 'minutes': 0,
                 'months': 0,
                 'seconds': 0,
                 'weeks': 0,
                 'years': 0},
 'model': 'CLA',
 'modelParams': {'anomalyParams': {u'anomalyCacheRecords': None,
                               u'autoDetectThreshold': None,
                               u'autoDetectWaitRecords': None},
             'clParams': {'alpha': 0.0041745905756698015,
                          'regionName': 'SDRClassifierRegion',
                          'steps': '1',
                          'verbosity': 0},
             'inferenceType': 'NontemporalMultiStep',
             'sensorParams': {'encoders': {'_classifierInput': {'classifierOnly': True,
                                                                'clipInput': True,
                                                                'fieldname': 'result',
                                                                'maxval': 5000,
                                                                'minval': 0,
                                                                'n': 36,
                                                                'name': '_classifierInput',
                                                                'type': 'ScalarEncoder',
                                                                'w': 21},
                                           u'result': None,
                                           u'stamp_dayOfWeek': None,
                                           u'stamp_timeOfDay': None,
                                           u'stamp_weekend': {'fieldname': 'stamp',
                                                              'name': 'stamp',
                                                              'type': 'DateEncoder',
                                                              'weekend': (21,
                                                                          1)}},
                              'sensorAutoReset': None,
                              'verbosity': 0},
             'spEnable': True,
             'spParams': {'boostStrength': 0.0,
                          'columnCount': 2048,
                          'globalInhibition': 1,
                          'inputWidth': 0,
                          'numActiveColumnsPerInhArea': 40,
                          'potentialPct': 0.8,
                          'seed': 1956,
                          'spVerbosity': 0,
                          'spatialImp': 'cpp',
                          'synPermActiveInc': 0.05,
                          'synPermConnected': 0.1,
                          'synPermInactiveDec': 0.08290353055187707},
             'tpEnable': True,
             'tpParams': {'activationThreshold': 13,
                          'cellsPerColumn': 32,
                          'columnCount': 2048,
                          'globalDecay': 0.0,
                          'initialPerm': 0.21,
                          'inputWidth': 2048,
                          'maxAge': 0,
                          'maxSegmentsPerCell': 128,
                          'maxSynapsesPerSegment': 32,
                          'minThreshold': 10,
                          'newSynapseCount': 20,
                          'outputType': 'normal',
                          'pamLength': 2,
                          'permanenceDec': 0.1,
                          'permanenceInc': 0.1,
                          'seed': 1960,
                          'temporalImp': 'cpp',
                          'verbosity': 0},
             'trainSPNetOnlyIfRequested': False},
 'predictAheadTime': None,
 'version': 1}

Interesting. The first model params you got back indicated that the value of result did not contribute to its own prediction, and that “day of week” did contribute. This tells me that there may not be discernible patterns in the data. Can a human look at this data and see patterns over different time periods?

Anyway, I wanted to make sure your params were right and when you ran the 2nd time, same thing (although oddly this time it “weekend/weekday” that contributes most to prediction.

My guess is that prediction confidence is really bad throughout because it’s not seeing any repeating sequences in the data.

I was wondering if the randomness / noise would be an impediment to anomaly detection. While there may be patterns at different levels there are no guarantees.

If you are just interested in anomaly detection on a single scalar value, I would use the parameters returned by getScalarMetricWithTimeOfDayAnomalyParams, which would be something like this:

{
  "inferenceArgs":{
    "predictionSteps":[
      1
    ],
    "predictedField":"c1",
    "inputPredictedField":"auto"
  },
  "modelConfig":{
    "aggregationInfo":{
      "seconds":0,
      "fields":[

      ],
      "months":0,
      "days":0,
      "years":0,
      "hours":0,
      "microseconds":0,
      "weeks":0,
      "minutes":0,
      "milliseconds":0
    },
    "model":"HTMPrediction",
    "version":1,
    "predictAheadTime":null,
    "modelParams":{
      "sensorParams":{
        "sensorAutoReset":null,
        "encoders":{
          "c0_dayOfWeek":null,
          "c0_timeOfDay":{
            "fieldname":"c0",
            "timeOfDay":[
              21,
              9.49
            ],
            "type":"DateEncoder",
            "name":"c0"
          },
          "c1":{
            "name":"c1",
            "resolution":0.7692307692307693,
            "seed":42,
            "fieldname":"c1",
            "type":"RandomDistributedScalarEncoder"
          },
          "c0_weekend":null
        },
        "verbosity":0
      },
      "anomalyParams":{
        "anomalyCacheRecords":null,
        "autoDetectThreshold":null,
        "autoDetectWaitRecords":5030
      },
      "spParams":{
        "columnCount":2048,
        "synPermInactiveDec":0.0005,
        "spatialImp":"cpp",
        "inputWidth":0,
        "spVerbosity":0,
        "synPermConnected":0.2,
        "synPermActiveInc":0.003,
        "potentialPct":0.8,
        "numActiveColumnsPerInhArea":40,
        "boostStrength":0.0,
        "globalInhibition":1,
        "seed":1956
      },
      "trainSPNetOnlyIfRequested":false,
      "clParams":{
        "alpha":0.035828933612158,
        "verbosity":0,
        "steps":"1",
        "regionName":"SDRClassifierRegion"
      },
      "tmParams":{
        "columnCount":2048,
        "activationThreshold":13,
        "pamLength":3,
        "cellsPerColumn":32,
        "permanenceDec":0.1,
        "minThreshold":10,
        "inputWidth":2048,
        "maxSynapsesPerSegment":32,
        "outputType":"normal",
        "initialPerm":0.21,
        "globalDecay":0.0,
        "maxAge":0,
        "newSynapseCount":20,
        "maxSegmentsPerCell":128,
        "permanenceInc":0.1,
        "temporalImp":"cpp",
        "seed":1960,
        "verbosity":0
      },
      "tmEnable":true,
      "clEnable":false,
      "spEnable":true,
      "inferenceType":"TemporalAnomaly"
    }
  }
}

These are tuned for anomaly detection on scalar data streams, specifically for min=0, max=5000 (I just ran this function). You’ll need to replace the c0, c1 values with your field names. I hope that helps?

Im not clear on how to use these params. Still looking for the ‘aha’ moment w/re HTM / nupic.

I tried running the params that swarming produced (with your mods). It didn’t output anything in the anomaly column and the predicted values were essentially random.

Is it possible that my data is just too noisy for HTM?

One big question I have that you have not answer is this: Can a human detect the anomalies in this data? If you say that any values above 600 are essentially anomalous, wouldn’t a threshold work? Also, since you are working in HTM.Java and there is no anomaly likelihood functionality (isn’t that right @cogmission?) , that may be the missing piece.

Can a human detect the anomalies in this data?

Yes. Definitely. Getting away from ‘eyes only’ is one of the goals of this project.

The 600 value is just for this data set. We have millions (another big issue with any proposed solution).

and there is no anomaly likelihood functionality

Ive done my best to replicate the likelihood functionality from python in my code.

2 posts were split to a new topic: Missing anomaly scores