Is my data being predicted correctly?

Okay, I ran my data and it looks pretty nice.



I modified Setus’s template for multiple fields and put this in (along with the import of the anomalyLikelihood library)

anomalyLikelihood = anomaly_likelihood.AnomalyLikelihood()
likelihood = anomalyLikelihood.anomalyProbability(results, anomaly_score, 0)

It gives me a likelihood of 0.5 throughout the entire data. So I assume I did something wrong in assigning these values. I tried other values though, and it still gave me that likelihood, so I’m stuck.
I just remembered something interesting. When I was trying out the hotgym anomaly demo with my data, I would always get an output of 0.5 in the graph below. Well, it looks like this is the same deal here. However, in the hotgym demo I didn’t really modify the code much, I just ran it through my data.

1 Like

I can see from your code that you’re doing it wrong as I’ve successfully used likelihood, and although the first 400-ish values were indeed 0.5, all the rest were other than 0.5.
Insert these lines inside my template code in process_input.py at the corresponding line numbers

31 | from nupic.algorithms import anomaly_likelihood
44 | anomaly_likelihood_helper = anomaly_likelihood.AnomalyLikelihood()
218 | outputRow = [row[0], row[predicted_field_row], “prediction”, “anomaly score”, “anomaly likelihood”]
251 | anomaly_likelihood_score = anomaly_likelihood_helper.anomalyProbability(original_value, anomaly_score, time_index)
254 | outputRow = [time_index, original_value, “%0.2f” % inference, anomaly_score, anomaly_likelihood_score]

Oh, that’s the right way to do it, thank you!

I ran a large swarm over my data, it took 9+ hours to run, haha.

@Addonis: Unsolicited comment :slight_smile: We are in similar boat (trying to get nupic run with most optimal settings). I am considering just using NAB as proxy with my own dataset. Afterall, it’s the key benchmark - so any optimal setting would make its way into this. Link here https://github.com/numenta/NAB

@Setus, what’s line 218 for in your

Insert these lines

code block

@vkruglikov nothing more than just printing out the header line/first line for the output file, describing which values are located in which column.
[row[0], row[predicted_field_row], “prediction”, “anomaly score”, “anomaly likelihood”] will stand for

your_time_index_name, your_predicted_metric_name, prediction, anomaly score, anomaly likelihood

so for example

date, EKG, prediction, anomaly score, anomaly likelihood
23.03.2016, 3045.6, 3030.3, 0.8, 0.4

The point of that line was to update the header from the previous header consisting of

your_time_index_name, your_predicted_metric_name, prediction, anomaly score

to

your_time_index_name, your_predicted_metric_name, prediction, anomaly score, anomaly likelihood

For some reason I get an error saying columnCount and inputWidth need to be above zero when I try to run a RandomDistributedScalarEncoder over your template.
I just change

{'encoders': {u'Vals': {'clipInput': True,
                                                         'fieldname': 'Vals',
                                                         'maxval': 952,
                                                         'minval': 337,
                                                         'n': 222,
                                                         'name': 'Vals',
                                                         'type': 'ScalarEncoder',
                                                         'w': 211}},

to

{'encoders': {u'Vals': {'clipInput': True,                     'classifierOnly': True,
                                                                         'fieldname': 'Vals',
                                                                         'name': '_classifierInput',
                                                                         "resolution": 100,
                                                                         "seed": 42,
                                                                         'type': 'RandomDistributedScalarEncoder'

but I get those errors as I mentioned above. But I didn’t get errors when I did this in the one_gym demo.
I guess it won’t work like this because the one_gym uses other code with its encoder, right? I was wondering if you have a github with your code

Hi, sorry for the late answer :slight_smile:
I’ve personally never experienced such an error before, so I don’t really know what the ‘columnCount and inputWidth above zero’ thing is about. However, I know that the “_classifierInput” is different from the normal non-classifierInput encoder (ie when name is _classifierInput and classifierOnly is true). I’ve noticed that the generated encoder parameters for the classifierInput are always bigger than for normal encoders and seems to affect the results of the CLA quite profoundly (in a negative outcome for me at least). I couldn’t find very much information about it so I decided to ask about it and other things in my post here, although none have dared to answer so far :stuck_out_tongue:

If I remember correctly, one_gym simply runs a swarm, and uses the values from the model_params file to run the CLA on, which is exactly what my template code does, unless there is a difference in the input parameters that are sent to the swarm between my template and one_gym. No, I don’t have a github of my code, because my code is nothing more than the template really with ever so small variations here and there when I’m experimenting with different data sets. The NuPIC codebase is quite a behemoth so I haven’t dared altering any code really. All I’ve done is tried my best to understand HTM and NuPIC it, run it, understand why I get the results that I get, and use all that understanding to get the best possible results.

Remember, if you are looking for anomalies on scalar input data, you don’t need to swarm.

lol


I think there is a difference, since the hotgym data uses the coded parameters from the energy and the weekly dates. But your template is more general, since it can take any file/input into it.
Anyway, thanks for the answer.


So then swarming is only needed if you want to predict something, not to detect anomalies, right?

For the most part, yes. If you are doing anomaly detection on non-scalar input data, you’re going to have to experiment because we don’t have pre-established model parameters for that stuff.

Is there an encoder to encode the difference between numbers in the data set? Or perhaps a percent change? It seems like encoding data using a diff or % change might preserve the pattern, while making it easier for HTM to predict this type of data.

@mellertson

Couldn’t this be done using a pre-processing step to output, say for instance: Original | Delta | %Change ? …in a cvs format and then have a Scalar or RandomDistributedScalar Encoder process these 3 columns?

1 Like

Seems like that would work. Good idea.

See the DeltaEncoder:

1 Like

That’s just what I was looking for. Thanks!

I guess I should take a more in depth look through all of the encoders. I’m guessing there might be others, I’m still unaware of, that could be useful.

The forum search is pretty good, but this one is better because it searches GitHub issues and mailing list archives.

A post was split to a new topic: Model is so slow to track the new trend