About getting anomalyProbability

Hi,

Could any one please help me on getting anomalylikelihood?

1: I have input data stream of just timestamp and a field which contains categorical values in string type (I flagged the field with ‘C’).
likelihood = anomalyLikelihood.anomalyProbability(actualValue, anomalyScore, timestamp)
In this code, I put string typed value for the actualValue and it gives error like the followings:

Traceback (most recent call last):
  File "D:/PY_WORKSPACE/nupic_inhouse/pretest/run_anomalyDetection.py", line 159, in <module>
    runAnomalyDetection(scalar=False)
  File "D:/PY_WORKSPACE/nupic_inhouse/pretest/run_anomalyDetection.py", line 137, in runAnomalyDetection
    likelihood = anomalyLikelihood.anomalyProbability(actualValue, anomalyScore, timestamp)
  File "C:\Python27\lib\site-packages\nupic\algorithms\anomaly_likelihood.py", line 317, in anomalyProbability
    skipRecords=numSkipRecords)
  File "C:\Python27\lib\site-packages\nupic\algorithms\anomaly_likelihood.py", line 473, in estimateAnomalyLikelihoods
    performLowerBoundCheck=False)
  File "C:\Python27\lib\site-packages\nupic\algorithms\anomaly_likelihood.py", line 689, in estimateNormal
    "mean": numpy.mean(sampleData),
  File "C:\Python27\lib\site-packages\numpy\core\fromnumeric.py", line 2942, in mean
    out=out, **kwargs)
  File "C:\Python27\lib\site-packages\numpy\core\_methods.py", line 65, in _mean
    ret = umr_sum(arr, axis, dtype, out, keepdims)
TypeError: cannot perform reduce with flexible type

But with different data stream with a timestamp and a field of float typed values, no errors.

Is there any way to avoid such error with categorical field?

2: How should I put actualValue when I run the anomalyDetection for multiple fields data (want to know anomaly of the combination of two values)?
For example, data looks like below:

c0,c1,c2
datetime,float,float
T,,
2017-01-01 1:00,0,0
2017-01-01 2:00,0.114,0.114
2017-01-01 3:00,0.226,0.228
2017-01-01 4:00,0.336,0.343

@scott @subutai I have never done anomaly likelihood with string categories. Can NuPIC handle that?

Yes it should work, as long as the correct encoders are used (category encoder?)

It’s good to hear from you and rhyolight.

I used the category encoder by specifying the MODEL_PARAMS like the followings:
(nupic 0.5.7)

def createModel():
  return ModelFactory.create(category_model_params.MODEL_PARAMS)

The contents of category_model_params:

MODEL_PARAMS = {'aggregationInfo': {'days': 0,
                     'fields': [],
                     'hours': 0,
                     'microseconds': 0,
                     'milliseconds': 0,
                     'minutes': 0,
                     'months': 0,
                     'seconds': 0,
                     'weeks': 0,
                     'years': 0},
 'model': 'CLA',
 'modelParams': {'anomalyParams': {u'anomalyCacheRecords': None,
                                   u'autoDetectThreshold': None,
                                   u'autoDetectWaitRecords': None},
                 'clParams': {'alpha': 0.09946475054821349,
                              'regionName': 'SDRClassifierRegion',
                              'steps': '1',
                              'verbosity': 0},
                  # 'inferenceType': 'NontemporalMultiStep',
                 'inferenceType': 'TemporalAnomaly',
                 # 'inferenceType': 'TemporalMultiStep',
                 'sensorParams': {'encoders': {
                                               '_classifierInput': {'classifierOnly': True,
                                                                    'fieldname': 'cat',
                                                                    'n': 521,
                                                                    'name': '_classifierInput',
                                                                    'type': 'SDRCategoryEncoder',
                                                                    'forced': True,
                                                                    'w': 21},
                                               u'cat': {'fieldname': 'cat',
                                                        'name': 'cat',
                                                        'w': 21,
                                                        'n': 521,
                                                        'forced': True,
                                                        'type': 'SDRCategoryEncoder'},

I have also tried categoryEncoder instead of SDRCategoryEncoder.

Could you recommend me to do some modification?

Answer by myself:
I analyzed the codes (anomaly_likelihood.py) and found that there is a block that investigate the varience of actualValue with comments like followings:

# HACK ALERT! The CLA model currently does not handle constant metric values
# very well (time of day encoder changes sometimes lead to unstable SDR's
# even though the metric is constant). Until this is resolved, we explicitly
# detect and handle completely flat metric values by reporting them as not
# anomalous.
s = [r[1] for r in aggRecordList]
metricValues = numpy.array(s)
print metricValues
metricDistribution = estimateNormal(metricValues[skipRecords:],
                                    performLowerBoundCheck=False)
if metricDistribution["variance"] < 1.5e-5:
   distributionParams = nullDistribution(verbosity = verbosity)

I guess this part is used for the cases where the numeric input data shows very little variation.
However, this part does not seems to properly deal with categorical input data.

Assuming that my categorical input data is sufficiently not static, I removed the block above and runs well.

Please let me know if I misunderstood.

@scott Can you read @oreore’s comment above and provide your viewpoint? Is this a bug?

What do you mean by this? Does it throw an exception or do you see bad anomaly likelihood values?

@scott what concerns me is his comment that when removing the “hack” code in the snippet he pasted, it ran as he expected.

@rhyolight - I don’t understand. Of course the code will run just fine without the “hack”, it just might get poor performance on values that stay constant for a while when there are other fields (like a timestamp) that are changing.

But why did he remove it? What was the problem?

Hi,
What I mean was the error attached in the first post of this thread.
I guess the metricValues array contain “categorical values” which are actual values input from the stream.
But I doubt that estimateNormal function using the array makes error.

Oh I see now. Yes, we should only do the “hack” for numeric values.

I have a PR to address the issue here: