Is my data being predicted correctly?

Don’t know unfortunately, you’ll have to ask others on this site.


I’m using my template myself and I had to update it to support multiple fields of data, which you can now also get here. Keep in mind that even if you feed multiple fields of data to the HTM, it can still only predict 1 metric. The way to use it now is as follows

./process_input.py input_file_name output_file_name last_record date_type predicted_field_name other_included_field_name, etc…

You use it similarly to the previous file, except this time you need only specify how many line records the swarm needs to swarm over or -1 to not run the swarm first. Since it has support for multiple data fields, the data field which you want to predict must be the first data field provided, then you can add as many other data fields that are inside your input file.
Example in use:

./process_input.py input-file.csv output-file.csv 3000 EU metric-to-predict additional-metric1 additional-metric2

or

./process_input.py input-file.csv output-file.csv -1 OFF metric-to-predict additional-metric3


This new template will also output a prediction quality analysis report, which works as follows:
say that HTM is fed a set of values with some patterns that it learns. At a certain point in time, the fed values are [5, 8, 2, 9, 4, 6, 7] and the respective predicted values at those same time points are [6, 5, 2, 10, 5, 6, 8]. HTM will rarely predict perfectly, but some set of model_params.py parameters will lead to better prediction results than others. A quick way to analyze how well the predicted results are compared to the actual values (instead of plotting everything inside a spreadsheet document) is analyzing both the absolute difference and relative difference. The first thing done is that the absolute difference between all values is calculated and summed up at the end. However, there is an unfairness in that if the data consisted of only small values (or very large values), and HTM predicted values that were off but also small, then the summed absolute difference would be comparatively smaller than the summed absolute difference of larger values. So to make up for that, the absolute largest and smallest value in the predicted-data-metric set is found and an absolute largest difference is calculated. Then, all calculated absolute differences between fed and predicted values are compared to the absolute largest difference by percentage. Finally, a record of percentages is displayed ie what percentage of predicted values were off by 0-0.99% of the absolute largest difference, what percentage of predicted values were off by 1-25% of the absolute largest difference and so on…


Note that utilizing dates (EU/US) instead of simple indexes (OFF) is treated as an additional metric by HTM as if you were doing this

./process_input.py input-file.csv output-file.csv 3000 metric-to-predict date-EU

By the way, since you intend to use multiple fields of data, I strongly suggest you read this post here

Sorry for the stupid questions to come,

What you mean by, “it can only predict one metric” , is that it can only predict one type of data (e.g., float) when running?

The data field that is in the first column of the .csv? But it still predicts all of them, right? (I don’t understand how it works too well, as you can see).

I’ll try to read the wiki page on this again, later today.


I think I did something wrong.

$ ./process_input.py ecg_1.csv ecg_1_output.csv 20000 OFF I II III aVR aVL aVF V1 V2 V3 V4 V5 V6
Traceback (most recent call last):
  File "./process_input.py", line 351, in <module>
    configure_swarm_parameters(input_file_name, last_record, date_type, included_fields)
  File "./process_input.py", line 180, in configure_swarm_parameters
    current_field_value = float(stored_input_file[current_field_index][current_field_row])
ValueError: could not convert string to float: 

you can see how my data is organized in the picture. It’s pretty standard csv format.

Say that you have an input file with multiple metrics inside of it, ie multiple data sets separated by columns such as date, metric1, metric2, metric3. HTM can only work on predicting 1 metric at a time, but can use the other metrics as help if there are correlations to be found between the metrics. So for example, if you were to have an input file with only metric2 inside, it might have a 67% correct prediction rate for metric2. But if you were to give it an input file with metric2, metric3 and metric4, it might find correlations between the data of metric2 and metric3 but not metric2 and metric4. This means that HTM would ignore metric4 and only feed itself with metric2 and metric3, and probably get an increased correct prediction rate (say 75% for example) of metric2 thanks to the help of the correlation it found between metric2 and metric3.

First, did you download the latest template that I provided in my previous post from 1 day ago, as I have updated the code for that template to support multiple fields?

Secondly, your input file needs to follow a bare minimum of structure. The first column must be your ‘index of reference column’ such as a date (running in chronological order) or simply an index (running chronologically). What I meant about the predicted column being first is that you must supply the name of that column first in your command line after having input all of the other obligatory values. Here are 2 examples

If I wanted to predict the metric TAM and include all of the other metrics as suggested additional correlation help, I would have to write

./process_input.py input-file.csv output-file.csv 3000 EU TAM FFM PRM RR UM

And the 2nd example

This time, if I wanted to predict PRM and only include the additional metrics TAM and UM I would have to write

./process_input.py input-file.csv output-file.csv 3000 OFF PRM TAM UM


Note that the reason that I say suggested additional metrics is because if you were to for example process my first example, at first a swam would be run, the swarm would do 2 things:

  1. It would try to find the best parameters that optimally predict metric TAM
  2. It would try to find out if there are correlations between the other metrics (FFM, PRM, RR, UM) and TAM that help HTM in predicting TAM even better.
    2.1. If if does find a positive correlation between say RR and TAM, it will try to find good encoder parameters for RR and include those inside the model_params.py file. (In my experience, those added-metric encoder parameters are always bad and I always have to change them manually to using the RandomDistributedScalarEncoder or even excluding them myself).
    2.2. If it does not find a positive correlation between RR and TAM, it will “exclude” that metric from the model_params.py file by writing “None” near the value RR.

Then once the swarm is finished, and the model_params.py file is generated and the additional metrics have been either included or excluded, a new HTM will be run with that model_params.py file and this time, the HTM will depending on the config file include or exclude the other additional parameters when trying to predict the metric-to-predict.

Hmm, that makes sense. I guess what sucks is that I would then have to rerun it each time. Or maybe the code can somewhow be changed to do more metrics…unless the process you described is a fundemental limitation of the system.


yes

okay, that’s a bit different from the way I did it last time then. When it was just one metric, I just had my .csv file have one column with just the data (no index column).

Correct, from I understand, that is indeed how HTM works, which means that you’ll indeed have to run it for every metric that you wish to predict.

First off, kudos to both of you for having this conversation while I was away. Thanks @Setus for helping @Addonis out!

Try running the anomaly likelihood instead of the raw anomaly score. This is not actually a part of the NuPIC model, but a post-process:

As shown in the code above, you create an instance of the anomaly likelihood class and it maintains state. You pass it the value, anomaly score, and timestamp for each point and it will return a more stable anomaly indication than the raw anomaly score.

Here is a detailed explanation of the anomaly likelihood (with awful audio and video, sorry):

See Preliminary details about new theory work on sensory-motor inference.

Thank you @rhyolight I appreciate it :blush:
Since I got so much help from the community, I just had to give something back by helping others in need, especially for a topic that is so relevant and important for many future users/testers that will undoubtedly wonder the same things.

1 Like

Yea, I’ve learned a ton from Setus.


So I can download utils.py to do this, or should I put that a different python file or something?[quote=“rhyolight, post:17, topic:735”]
Here is a detailed explanation of the anomaly likelihood
[/quote]
I’ve seen that video, but I’ll try to rewatch it if I have time.


Oh, I’ve read that post… I just didn’t think that what I was talking about here (creating more than one layer cortical columns) is what that Jeff Hawkins is working on now (I thought it was some kind of an algorithm to better receive and interpret/understand sensory-motor input)…I’ll have to watch the recent video on that as well…

No, you can copy and paste those few lines of code (along with the import statement and initialization of an instance of the class earlier in the file) into your project. It’s just an example of how to use it. We don’t have a good document on it yet.

Okay, I ran my data and it looks pretty nice.



I modified Setus’s template for multiple fields and put this in (along with the import of the anomalyLikelihood library)

anomalyLikelihood = anomaly_likelihood.AnomalyLikelihood()
likelihood = anomalyLikelihood.anomalyProbability(results, anomaly_score, 0)

It gives me a likelihood of 0.5 throughout the entire data. So I assume I did something wrong in assigning these values. I tried other values though, and it still gave me that likelihood, so I’m stuck.
I just remembered something interesting. When I was trying out the hotgym anomaly demo with my data, I would always get an output of 0.5 in the graph below. Well, it looks like this is the same deal here. However, in the hotgym demo I didn’t really modify the code much, I just ran it through my data.

1 Like

I can see from your code that you’re doing it wrong as I’ve successfully used likelihood, and although the first 400-ish values were indeed 0.5, all the rest were other than 0.5.
Insert these lines inside my template code in process_input.py at the corresponding line numbers

31 | from nupic.algorithms import anomaly_likelihood
44 | anomaly_likelihood_helper = anomaly_likelihood.AnomalyLikelihood()
218 | outputRow = [row[0], row[predicted_field_row], “prediction”, “anomaly score”, “anomaly likelihood”]
251 | anomaly_likelihood_score = anomaly_likelihood_helper.anomalyProbability(original_value, anomaly_score, time_index)
254 | outputRow = [time_index, original_value, “%0.2f” % inference, anomaly_score, anomaly_likelihood_score]

Oh, that’s the right way to do it, thank you!

I ran a large swarm over my data, it took 9+ hours to run, haha.

@Addonis: Unsolicited comment :slight_smile: We are in similar boat (trying to get nupic run with most optimal settings). I am considering just using NAB as proxy with my own dataset. Afterall, it’s the key benchmark - so any optimal setting would make its way into this. Link here https://github.com/numenta/NAB

@Setus, what’s line 218 for in your

Insert these lines

code block

@vkruglikov nothing more than just printing out the header line/first line for the output file, describing which values are located in which column.
[row[0], row[predicted_field_row], “prediction”, “anomaly score”, “anomaly likelihood”] will stand for

your_time_index_name, your_predicted_metric_name, prediction, anomaly score, anomaly likelihood

so for example

date, EKG, prediction, anomaly score, anomaly likelihood
23.03.2016, 3045.6, 3030.3, 0.8, 0.4

The point of that line was to update the header from the previous header consisting of

your_time_index_name, your_predicted_metric_name, prediction, anomaly score

to

your_time_index_name, your_predicted_metric_name, prediction, anomaly score, anomaly likelihood

For some reason I get an error saying columnCount and inputWidth need to be above zero when I try to run a RandomDistributedScalarEncoder over your template.
I just change

{'encoders': {u'Vals': {'clipInput': True,
                                                         'fieldname': 'Vals',
                                                         'maxval': 952,
                                                         'minval': 337,
                                                         'n': 222,
                                                         'name': 'Vals',
                                                         'type': 'ScalarEncoder',
                                                         'w': 211}},

to

{'encoders': {u'Vals': {'clipInput': True,                     'classifierOnly': True,
                                                                         'fieldname': 'Vals',
                                                                         'name': '_classifierInput',
                                                                         "resolution": 100,
                                                                         "seed": 42,
                                                                         'type': 'RandomDistributedScalarEncoder'

but I get those errors as I mentioned above. But I didn’t get errors when I did this in the one_gym demo.
I guess it won’t work like this because the one_gym uses other code with its encoder, right? I was wondering if you have a github with your code

Hi, sorry for the late answer :slight_smile:
I’ve personally never experienced such an error before, so I don’t really know what the ‘columnCount and inputWidth above zero’ thing is about. However, I know that the “_classifierInput” is different from the normal non-classifierInput encoder (ie when name is _classifierInput and classifierOnly is true). I’ve noticed that the generated encoder parameters for the classifierInput are always bigger than for normal encoders and seems to affect the results of the CLA quite profoundly (in a negative outcome for me at least). I couldn’t find very much information about it so I decided to ask about it and other things in my post here, although none have dared to answer so far :stuck_out_tongue:

If I remember correctly, one_gym simply runs a swarm, and uses the values from the model_params file to run the CLA on, which is exactly what my template code does, unless there is a difference in the input parameters that are sent to the swarm between my template and one_gym. No, I don’t have a github of my code, because my code is nothing more than the template really with ever so small variations here and there when I’m experimenting with different data sets. The NuPIC codebase is quite a behemoth so I haven’t dared altering any code really. All I’ve done is tried my best to understand HTM and NuPIC it, run it, understand why I get the results that I get, and use all that understanding to get the best possible results.

Remember, if you are looking for anomalies on scalar input data, you don’t need to swarm.

lol


I think there is a difference, since the hotgym data uses the coded parameters from the energy and the weekly dates. But your template is more general, since it can take any file/input into it.
Anyway, thanks for the answer.


So then swarming is only needed if you want to predict something, not to detect anomalies, right?

For the most part, yes. If you are doing anomaly detection on non-scalar input data, you’re going to have to experiment because we don’t have pre-established model parameters for that stuff.