An obvious type of Anomalies was not detected by HTM

Hello, Today I took advantage of the function: getScalarMetricWithTimeOfDayAnomalyParams, and then set ‘clEnable=True’ by using params['modelConfig']['modelParams']['clEnable'] = True ( in order to get the prediction value: result.inferences["multiStepBestPredictions"][1] (is it a proper way to get this value ?)

then run an anomaly detection programme, however, the result I got confused me a lot:

In above picture, the upper sublpot illustrates the raw data( blue line) and the predicted data (red line) ; and the bottom subplot portrays the anomaly score(green line).

what confused me are these enclosed data ( by green cycles) : there are huge differences between raw data (blue) and predicted data (red), however the anomaly scores are much small (indicating normal data) I wonder the reason. ( by the way, why do the predicted data (red line) jump suddenly at these places?)

The above image makes me neverus seriously. Data in the green cycle are modified by setting them about 15000,which are no doubt anomalies(and never appeared before). Nevertheless, the predicted data are quite similar with raw data( even at the beginning of the pattern, which shouldn’t be like this); what’s more, the anomaly scores ( green line below) are quite small.
Please help me. thanks

2 Likes

I also have quite a few examples where HTM works fine, but fails all too often in extremely staight-forward sinthetic data.

Please try using the AnomalyLikelihood post-process and flag anomalies over 0.9999 (or more 9s?). You should get much better results.

Hi, Taylor, thanks for your timely reply, Ok, I will try likelyhood.:+1::hugs: Thanks sincerely.

Hello, I followed your advice, and use anomaly likelihood to replace anomaly score. but the result doesn’t seem to be better:

1

In the picture above, in the upper subplot, red_line : predicted data, blue_line: raw data, in the under suplot, the green line on be half of anomaly likelyhood values.

From the picture, we can see the anomaly likelyhood values often become about 0.95, but this is fine if we set the threshold to 0.99(or more 9s).
However, the points which likelyhood values are above this threshold puzzle me:

In the green cycle, I could understand why it detected anomaly; however, in the orange one, I could not follow the reasons, In additon, in the red one, this should be anomalies as mentioned above, it still ignore them( the likelyhood values are about 0.95 <0.99).

2

I read the paper published in 2017: Unsupervised real-time anomaly detection for streaming data , moving average(short period and long period) is one element of the core idea, I read the source code, I find default value for short period ( slide window is 10( right ?):

def _anomalyScoreMovingAverage(anomalyScores,
                               windowSize=10,
                               verbosity=0,
                              ):

therefore, if it will help when changing this param?

the main window(long period ) are set 8000( in the paper) but I didn’t find this value in source code.


  def __init__(self,
               claLearningPeriod=None,
               learningPeriod=288,
               estimationSamples=100,
               historicWindowSize=8640,
               reestimationPeriod=100):

in the function indicate the anomaly likelyhood line begin to change at about 388 (right?, in line with the following image):

if it will help to modify these params ?

1 Like

@scott Any chance you could take a look at @Pegasus’s results above?

Yeah~, it has confused me a lot, which made me worry if it could be put into real time and real data…:disappointed_relieved:

The anomaly scores (raw, not likelihood) look wrong to me. Those should jump up to 1.0 fairly regularly when completely unexpected inputs occur. So it doesn’t seem right that they are always fairly low unless the model has seen this data previously. I believe you are getting predictions correctly and I think they look reasonable. How are you getting the anomaly scores?

Hello, Scott, thanks for your response. It is really great to receive your message in the morning (in China). Yes, I absolutely agree with you that the score(raw) should jump to 1.0 when coming across the data pattern never seen before ( and I promise that the pattern doesn’t appear before at all).
Therefore, it is the unexpected pattern of anomaly score( raw and likelihood) that confuse me seriously.

(By the way, why does the model lose its prediction ability in the green cycle(in the below image)?( red line is on behalf of predicted data, and the blue line raw data))

As for how i got the anomaly score( green line in the lower subplot of the image):
In the model, I take advantage of the function:
params = getScalarMetricWithTimeOfDayAnomalyParams(),
and then
params['modelConfig']['modelParams']['clEnable'] = True ( in order to get: prediction = result.inferences["multiStepBestPredictions"][1] where the result = model.run({'c0':timestamp,'c1':value}))
therefore, the anomaly score:
anomalyscore = result.inferences["anomalyScore"]
(was I right?)

so given this context, How should to modify the model to get a more precise result?

Thanks, I have been trapped here for several days. I will appreciate sincerely if you pull me out of this annoying mire.
:pray::pray::pray:

1 Like

Hi, Taylor, thanks for inviting scott to this topic, and i have at least get some idea about this, thanks again for what you have done for me sincerely ~ :smile: :love_you_gesture:

As far as I can tell, you are correctly retrieving the anomaly score. But it is surprising that the anomaly score isn’t more sensitive to unpredictability in the data. I’m not really sure why the results aren’t better, your data certainly looks like a very good application. Below are some debugging ideas, focused on the anomaly scores, not the likelihood. Keep in mind that what we want to happen is the anomaly scores to be fairly erratic but get very low during the flatline period (because the data becomes very predictable). This should cause the anomaly likelihood to shoot up. But in your results, the anomaly scores are fairly low generally, and the decrease during flatlining is relatively minor.

One possibility is that the encoder parameters are too course so different values in the data appear similar to the model. This seems somewhat unlikely due to the fact that predictions look pretty fine-grained. But you could test this by changing the minVal/maxVal parameters to getScalarMetricWithTimeOfDayAnomalyParams. You can try a bigger or smaller range to see how that affects the anomaly scores.

If that doesn’t work, you can also try changing the TM learning rate. Lowering the rate should result in overall higher anomaly scores. For instance, you could try a slower learning rate as follows:
params[‘modelConfig’][‘modelParams’][‘tmParams’][‘permanenceInc’] = 0.05
params[‘modelConfig’][‘modelParams’][‘tmParams’][‘permanenceInc’] = 0.05

Or a higher learning rate (although I don’t think this will help - just for comparison):
params[‘modelConfig’][‘modelParams’][‘tmParams’][‘permanenceInc’] = 0.15
params[‘modelConfig’][‘modelParams’][‘tmParams’][‘permanenceInc’] = 0.05

The third thing you could try is the other temporal memory implementation. You can do this by specifying tmImplementation=“tm_cpp” to getScalarMetricWithTimeOfDayAnomalyParams. This might break some things and I’d expect to give worse results normally but is useful as a debugging exercise.

1 Like

Hi, Scott, thanks for your detailed and feasible suggestions. Yesterday, I have tried some way to solve this problem,
like I changed the tmImplementation to ‘tm_cpp’, from the degree of likelihood value, it seems helpful to get a better result, but still fail to correctly predict the anomaly mentioned above, as the anomaly score, perhaps, the result became better, but not that obvious.
Yes, changing the range of raw data could be a way to test the model.

In the params dict, I find the date encoder params: timeofDay :[21,9.49] , which means as the time, encoder only could recognize time difference more than 9.49 hour, but the data i used are often hourly, minutely and even secondly, so I wonder if this would help when changing the radius to a smaller value, specially than my data’s granularity, say below 1 minute ?

And as the opf api is somewhat rigid, I would like to turn to network api. Will it work to generate a more or less better result since its flexibility?

Tomorrow is another sunny and happy Saturday for you, have a good time, good man~ :grinning: