Is my data being predicted correctly?

Addonis · June 9, 2016, 2:36am

I modified the hotgym prediction demo. I was wondering if it was running correctly. I made it swarm on EKG data. I’m mainly concerned that the same thing doesn’t happen with my modification as it did with the stock prediction model at the Hackathon “https://www.youtube.com/watch?v=KMnhMIv9DYc” where Jeff Hawkins says that it didn’t work because the best prediction was the actual data itself.

The first screenshot was taken in the middle of a run and the second screenshot was taken after it finished running. The swarm used was large.

I also tried running this data (small swarm) with anomaly detection but I’m not getting any anomaly detected. Even at the start where there clearly should be some anomaly to detect.

Setus · June 9, 2016, 3:35pm

Hi Addonis, I don’t work at Numenta, but I’ve been part of the community for a little while now, so I’ll try to help you out, based on what I understand as of today.

I’ve watched that video myself and looked at the code. It’s fine really, although I’ve personally had problems running the plot in real time on my Ubuntu set-up, so I’ve kind of gone away from it, and prefer to print out all of HTM’s results in a CSV file and just plot all of the data manually in either LibreOffice or some other similar program. To run HTM properly to its full capacity, you need to both fulfill a couple of steps and understand some things. Let’s start with the “theory” behind running HTM.

You need to create a configuration file which needs to be fed to the swarm. The config file has parameters that depend on both properties of the input file itself (most importantly the name of the metric that you wish to predict, it’s type and max-min values) and some custom choices (swarm over how many data set lines, the more the better but also slower, and what your focus is: TemporalAnomaly or MultiStep). Once the swarm has done its thing, it will generate a file called model_params.py inside of a folder called model_0 (and a couple of other files but you can ignore them). Two important thing to take note of concerning the generated model_params.py file, first, the swarm will always set the value of maxBoost to 2.0, that value must be changed to 1.0 for HTM to function properly, secondly, the swarm algorithm does not always give very good encoder parameter values. Read lower down to find out how to troubleshoot and fix badly generated encoder parameter values.
Then, you need to feed that generated model_params.py file into an OPF model together with your input file, which will run HTM with those model parameters on your input file.
Lastly, you need to decide how to extract and display your results from the OPF model in an output file. The extracting part concerns mostly predictions. HTM learns values in real time, but gives a prediction of the next value in the next time step, so do you want your predictions to always “lag” behind or be displayed at the same time as the current value is given. Example: you feed HTM these values in this order, 1,2,3,4,5,6,7,8,9,10,1,2,3,4,5,6,7,8,9,10,1,2,… The first time HTM sees 1 through 10, it obviously didn’t know anything, but the second time, do you want HTM to say that 4 was predicted when 4 was fed to HTM (aka it predicted the next value to be 4 at time step 14 and it indeed was fed 4) OR it predicted 4 when 3 was fed (aka HTM got fed 3 and predicted at time step 13 that the next value would be 4)? The main difference between those 2 ways of extracting values will result in either predictions always lagging behind the input data set values or being printed simultaneously as the data set values.

How to troubleshoot the swarm’s generated model_params.py file for bad encoder parameters.
Open the model_params.py file and look at the encoders section and see values such as
{'clipInput': True, 'fieldname': 'metric1', 'maxval': 18.2, 'minval': 0.7, 'n': 22, 'name': 'metric1', 'type': 'ScalarEncoder', 'w': 21}
As you can see in this example n has a value of 22 while w has a value of 21. I’m not sure how much HTM theory you understand, but n should be much larger than 22, generally between 300-400, but definitely not less than 100. If your generated model_params.py file has such values, I recommend using another encoder type called RandomDistributedScalarEncoder, by simply deleting all of the previous values above and inserting those instead
{"name": "metric1", "fieldname": "metric1", "resolution": 0.1, "seed": 42, "type": "RandomDistributedScalarEncoder"}
I recommend looking at the video about RandomDistributedScalarEncoder to understand how it works, but here is what I understand about it: you don’t need to give it max-min values, but you do need to give it a good resolution value. A good resolution value depends on your data set values. The encoder has 1000 buckets to put your values into, so depending on your chosen resolution, different values may fall into different or the same buckets. Example: data set values are [1,2,3,4,5,6,7,8,9,10] and resolution is 1, this means that each number will fall into its own bucket ie bucket 0 will hold value 1, bucket 1 holds value 2 and so on. If the resolution were to be 2, bucket 0 would hold both values 1 and 2, bucket 1 would hold values 3 and 4 and so on. However, if the resolution were 0.1, bucket 0 would hold value 1, then buckets 1-9 would be “empty” (they are not, values 1.1 through 1.9 would be stored accordingly inside buckets 1-9) and bucket 10 would hold value 2, then buckets 11-19 would be “empty” and bucket 20 would hold value 3 and so on. In my tests, I have found it to be preferable to have about 10 “empty” buckets between values of significant difference. So if your data set had values [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0] then you would need a resolution of 0.01.

My setup
I have created a template for easy all-in-one swarming, OPF running and result outputing on a csv file, which can be downloaded here from my Google Drive. The template consists of a couple of things, a main processing file which does the all-in-one part called process_input.py and a folder called model_0 with a practically empty model_params.py file which is the file onto which the model parameters will be written upon (and a neccessary init.py file). To run it all, simply add your data set input file inside of the template folder, then in Ubuntu open your terminal, go to the template folder location and write

./process_input.py input_file_name predicted_field_name last_record date_type output_file_name swarm

The predicted file name is obviously the name of the metric which you wish to be predicted (in case you had more than 1 metric inside your input file). Last record says how many records (by lines) you wish the swarm to run over, 3000 is preferable for decent results. Date type is if you are using dates or simple numbers as your “index of reference” for your data set values, this can be OFF if you are simply using numbers, US for american style dates and EU for european style dates. Swarm is True or False, if you want to swarm over the input file first. The point is that if you already swarmed over an input file which has filled the model_params.py file, you may want to change some values inside of model_params.py and immediately rerun HTM with that modified file without having to swarm all over again. Additionally, you shouldn’t have to fill out the above mentioned swarm configuration file as my program automatically fills it out when reading an input file. So an example run might be

./process_input.py input-file.csv metric1 3000 OFF output-file.csv True
or
./process_input.py input-file.csv metric1 3000 OFF output-file.csv False

The file was originally based off of Matt’s examples, but I modified it heavily for my needs. It should hopefully be self-explanatory, but I have to admit that that my python skills are still quite basic.

Addonis · June 10, 2016, 12:18am

Wow, your reply has a lot of information that has taught me many things I didn’t previously know. Thank you for that…

Does that mean that I have to manually set it to 1.0?

I don’t really understand what’s happening. When I use a resolution of 0.1 then my data lags behind and looks how I thought a prediction should look like, but if I use a resolution of 100 then my values don’t lag behind and instead they match my current values exactly. So either resolution of 100 makes it better or it actually breaks it. Could it be that I am running out of all the buckets too quickly when I use a resolution of 100? I’m attaching screenshots that I took from my large swarm runs (after they finished running) with the RandomDistributedScalarEncoder. first picture is of the 0.1 resolution randomdistributedencoder and second is of the 100 resolution randomdistributedencoder. I’m also going to attach a picture of a run that was made from a swarm that had just a ScalarEncoder, it looks pretty much exactly like the RandomDistributedScalarEncoder (at resolution 100) run. (By the way, my input is as follows: ~4000 EKG values, ~20 zeroes in the end)

RandomDistributedScalarEncoder resolution 0.1:

RandomDistributedScalarEncoder resolution 100:

(original) ScalarEncoder:

I tried it and it worked. I’ll try to do some stuff with it.

Setus · June 10, 2016, 5:59pm

Your welcome just trying to give back to the community for all the help I got by helping others.

Well, you have 2 choices, either set maxBoost manually to 1.0 yes in the model_params.py file or do it programmatically as I do in my template by retrieving the info from model_params.py and changing it inside the code, before handing it over to the OPF. This way, you don’t need to change things manually each time.

Concerning you resolution problem. I can see that your EKG data ranges from about -4000 to +4000, which probably means that small differences of ±0.1 or even ±1 are practically irrelevant I assume. If you were to use a resolution of 0.1, then the encoder would create buckets that would hold data like this: b1 = 0.1 - 0.1999, b2 = 0.20 - 0.2999, b3 = 0.3 - 0.3999, and so on for a 1000 buckets it might began creating buckets at 4000.1 - 4000.1999 and so on also. So my point is that your actual values would almost always overwhelmingly fall into the same bucket, while most would simply remain unused, so yes, that would explain why your results were bad with such a low resolution. With a resolution of 100 however, it becomes a lot better. Your buckets would be kind of like b1 = 100-199, b2 = 200-299, and so on, which it seems spans much more of your data, so that your data gets distributed much more evenly amongs the available buckets, though I guess you would have to experiment a bit to find out which resolution gives you the best results.

Glad to hear

amalta · June 10, 2016, 6:16pm

As of yesterday swarming now uses maxBoost to 1.0, so you do not have to do that if you don’t want to.

Setus · June 10, 2016, 6:33pm

Nice, thanks for the update

Addonis · June 11, 2016, 2:05am

I’m getting big anomalies (100%) throughout the entire data, which is odd, since the data has a lot of structure to it, I think. But maybe it’s because I didn’t use a large swarm and I didn’t swarm over too many numbers. Or maybe because it’s just normal to have anomaly here. (Picture below)

What kind of encoder does your template use?

By the way, it’s funny how the HTM School video that Matt put up today was about ScalarEncoders and RandomScalarDistributedEncoders, because you explained them to me yesterday in your long comment. Although the video made me understand them even better. I wish there some kind of an online class about HTM Theory. Maybe in the future…

Update:
After running another swarm I think there is a clear difference between large and medium swarms and how much data they swarm over.

Setus · June 13, 2016, 2:07pm

Personally, I’ve seen so many anomaly spikes (ie anomaly ratings reaching 1.0 aka 100%) that I’ve become used to seeing them quite often. You’re right, there is indeed a lot of structure to your pattern, however, I’ve understood with time that HTM cannot do inductive reasoning on patterns.
Example: say that you had a data set such as [1, 2, 8, 9, 15, 14, 1, 2, 8, 9, 15, 14, …], if further along the data your values suddenly changed, but the overall pattern remained the same such as […, 4, 5, 11, 12, 18, 17], then you would get anomaly ratings of about 1.0 at every of those values. Additionally, it would not be able to correctly predict any of those values, such as [4, 5, 11, 12, 18, ?] even though all those values followed the exact same pattern as the previous pattern, HTM would not be able to predict 17. So I don’t know what your exact EKG values are, but if there is some deviation in the actual values that is significant enough to place those values in other (maybe previously unused) buckets, then HTM will ‘ring the alarm’ and give big anomalies even though the pattern is always the same. With time however, HTM will most likely learn more of the pattern with a variety of different values and thus not give as big anomaly ratings.

However, HTM has a limit when it comes to creating what I see as “sequential memory links”, as in the previous example, HTM learned the pattern 1, 2, 8, 9, 15, 14, which means that “links” were created between the values 1 and 2, 2 and 8 and so on. But what if you had a pattern such as [1, 2, 1, 3, 1, 4, 1, 5, 1, 6, 1, 7, …] here HTM would create links between the values 1 and 2, 2 and 1, 1 and 3, 3 and 1, and so on. At some point no more new links could be created between 1 and X. Well, the main factor to influence this ability seems to be the number of cells per column (according to my tests and understanding). The more cells per column (cpc), the more cells there are to represent a value, and thus create links between that cell and cells of another column value. So the more cpc the more links can be created between values, and thus the more patterns can be learned between often used values. By default, the model_params.py sets the value of cpc to 32, and interestingly enough, after having done my tests on the mentioned example [1, 2, 1, 3, 1, 4, …, 1, X] I found X to be 14, and I had to increase the cpc value to 128 before I was able to get HTM to learn the pattern with X being 15 (haven’t tested its limits though). My point is that depending on the nature of your data, you might have to play a bit around with cpc to see if that helps HTM in learning (I know right, you must be realizing that there is quite a bit of fiddling involved here to get the optimal results).

My template uses the default encoder set by the swarm once the the model_params.py is generated, which means that is a simple ScalarEncoder. However, I usually like to manually edit the model_params.py file to RandomDistributedScalarEncoder if I get bad encoder results. [quote=“Addonis, post:7, topic:735”]
By the way, it’s funny how the HTM School video…
[/quote]

I agree lucky coincidence or fate?[quote=“Addonis, post:7, topic:735”]
I wish there some kind of an online class about HTM Theory. Maybe in the future…
[/quote]

Time will tell [quote=“Addonis, post:7, topic:735”]
After running another swarm I think there is a clear difference between large and medium swarms and how much data they swarm over.
[/quote]

Not surprised, I have never run a large swarm myself, as I’m afraid it would take much too long, but I guess the reward might be worth the wait in some cases.

Addonis · June 13, 2016, 4:30pm

I wonder if Jeff Hawkins is working on that, have you heard anything?

So if I set my cpc to 2^{30}, what will happen? I’m tempted to try it…although I don’t know if my computer can handle it. But maybe some other large number will suffice.
By the way, with the nupic.vision demo, I found that increasing the column dimension (the number of potential features) increased the success rate of the classification. Are we talking about the same kind of columns here? And is that the correct piece of code to change or is that “cheating” ? …At 350 column dimension it has a 99.5% success rate (with the cmr_all.xml).

Setus · June 13, 2016, 5:17pm

As of now, an HTM network operates with only 1 layer ie 2048 columns with 32 cells in each column. However, our brains have many more such layers, which are arranged in regions and the sum of those regions make up our brain. If I remember correctly, Jeff and Co are working on understanding and getting multiple layers and regions to communicate with each other. Once this is done, a true hierarchy will be implemented, and thus hierarchical processes may be performed, which should give rise to the possibility of higher order thinking such as inductive reasoning. However, there’s quite a step missing here, which I’m very interested in, namely the ‘processing’ part of it all. I’m fine with HTM ‘learning’ patterns (I look more at it as remembering sequential changes instead of learning/understanding), but I’m really looking forward to the processing part.

Well, that’s quite a large number, but nothing unbearably huge. I haven’t tried that myself, but I know that increasing the cpc lengthens the HTM’s processing time. Notice that doing so would give all 2048 columns 2^30 cells each, which I guess would take your computer for ever to process, but maybe not, all I know is it noticeably increased my processing of data sets when I went from 32 cpc to 128, by about a minute or 2 more maybe? Additionally, I’m not sure if there would be any real benefit to it, as increasing the number of buckets and resolution can ensure to reduce the amount of sequential links created between somewhat similar values, plus the fact that our brains don’t have so many cells per column.

Sorry, I haven’t even looked into nupic.vision, so I don’t know anything about it

Addonis · June 13, 2016, 6:49pm

Hmm, I wonder how far into that research they’re in. Is there some place I can read about their progress?

I’m not sure if I have to modify the code to be able to do this, but if not, how do I specify that I want to run over data that has multiple fields of data (organized in seperate columns of the .csv file)? I tried " python process_input.py ecg_1.csv I II III aVR aVL aVF V1 V2 V3 V4 V5 V6 20000 OFF ecg_1_output.csv True " but it’s not right so it didn’t work.

Setus · June 14, 2016, 12:30pm

Don’t know unfortunately, you’ll have to ask others on this site.

I’m using my template myself and I had to update it to support multiple fields of data, which you can now also get here. Keep in mind that even if you feed multiple fields of data to the HTM, it can still only predict 1 metric. The way to use it now is as follows

./process_input.py input_file_name output_file_name last_record date_type predicted_field_name other_included_field_name, etc…

You use it similarly to the previous file, except this time you need only specify how many line records the swarm needs to swarm over or -1 to not run the swarm first. Since it has support for multiple data fields, the data field which you want to predict must be the first data field provided, then you can add as many other data fields that are inside your input file.
Example in use:

./process_input.py input-file.csv output-file.csv 3000 EU metric-to-predict additional-metric1 additional-metric2

or

./process_input.py input-file.csv output-file.csv -1 OFF metric-to-predict additional-metric3

This new template will also output a prediction quality analysis report, which works as follows:
say that HTM is fed a set of values with some patterns that it learns. At a certain point in time, the fed values are [5, 8, 2, 9, 4, 6, 7] and the respective predicted values at those same time points are [6, 5, 2, 10, 5, 6, 8]. HTM will rarely predict perfectly, but some set of model_params.py parameters will lead to better prediction results than others. A quick way to analyze how well the predicted results are compared to the actual values (instead of plotting everything inside a spreadsheet document) is analyzing both the absolute difference and relative difference. The first thing done is that the absolute difference between all values is calculated and summed up at the end. However, there is an unfairness in that if the data consisted of only small values (or very large values), and HTM predicted values that were off but also small, then the summed absolute difference would be comparatively smaller than the summed absolute difference of larger values. So to make up for that, the absolute largest and smallest value in the predicted-data-metric set is found and an absolute largest difference is calculated. Then, all calculated absolute differences between fed and predicted values are compared to the absolute largest difference by percentage. Finally, a record of percentages is displayed ie what percentage of predicted values were off by 0-0.99% of the absolute largest difference, what percentage of predicted values were off by 1-25% of the absolute largest difference and so on…

Note that utilizing dates (EU/US) instead of simple indexes (OFF) is treated as an additional metric by HTM as if you were doing this

./process_input.py input-file.csv output-file.csv 3000 metric-to-predict date-EU

By the way, since you intend to use multiple fields of data, I strongly suggest you read this post here

Addonis · June 15, 2016, 1:07pm

Sorry for the stupid questions to come,

What you mean by, “it can only predict one metric” , is that it can only predict one type of data (e.g., float) when running?

The data field that is in the first column of the .csv? But it still predicts all of them, right? (I don’t understand how it works too well, as you can see).

I’ll try to read the wiki page on this again, later today.

I think I did something wrong.

$ ./process_input.py ecg_1.csv ecg_1_output.csv 20000 OFF I II III aVR aVL aVF V1 V2 V3 V4 V5 V6
Traceback (most recent call last):
  File "./process_input.py", line 351, in <module>
    configure_swarm_parameters(input_file_name, last_record, date_type, included_fields)
  File "./process_input.py", line 180, in configure_swarm_parameters
    current_field_value = float(stored_input_file[current_field_index][current_field_row])
ValueError: could not convert string to float:

you can see how my data is organized in the picture. It’s pretty standard csv format.

Setus · June 15, 2016, 3:13pm

Say that you have an input file with multiple metrics inside of it, ie multiple data sets separated by columns such as date, metric1, metric2, metric3. HTM can only work on predicting 1 metric at a time, but can use the other metrics as help if there are correlations to be found between the metrics. So for example, if you were to have an input file with only metric2 inside, it might have a 67% correct prediction rate for metric2. But if you were to give it an input file with metric2, metric3 and metric4, it might find correlations between the data of metric2 and metric3 but not metric2 and metric4. This means that HTM would ignore metric4 and only feed itself with metric2 and metric3, and probably get an increased correct prediction rate (say 75% for example) of metric2 thanks to the help of the correlation it found between metric2 and metric3.

First, did you download the latest template that I provided in my previous post from 1 day ago, as I have updated the code for that template to support multiple fields?

Secondly, your input file needs to follow a bare minimum of structure. The first column must be your ‘index of reference column’ such as a date (running in chronological order) or simply an index (running chronologically). What I meant about the predicted column being first is that you must supply the name of that column first in your command line after having input all of the other obligatory values. Here are 2 examples

If I wanted to predict the metric TAM and include all of the other metrics as suggested additional correlation help, I would have to write

./process_input.py input-file.csv output-file.csv 3000 EU TAM FFM PRM RR UM

And the 2nd example

This time, if I wanted to predict PRM and only include the additional metrics TAM and UM I would have to write

./process_input.py input-file.csv output-file.csv 3000 OFF PRM TAM UM

Note that the reason that I say suggested additional metrics is because if you were to for example process my first example, at first a swam would be run, the swarm would do 2 things:

It would try to find the best parameters that optimally predict metric TAM
It would try to find out if there are correlations between the other metrics (FFM, PRM, RR, UM) and TAM that help HTM in predicting TAM even better.
2.1. If if does find a positive correlation between say RR and TAM, it will try to find good encoder parameters for RR and include those inside the model_params.py file. (In my experience, those added-metric encoder parameters are always bad and I always have to change them manually to using the RandomDistributedScalarEncoder or even excluding them myself).
2.2. If it does not find a positive correlation between RR and TAM, it will “exclude” that metric from the model_params.py file by writing “None” near the value RR.

Then once the swarm is finished, and the model_params.py file is generated and the additional metrics have been either included or excluded, a new HTM will be run with that model_params.py file and this time, the HTM will depending on the config file include or exclude the other additional parameters when trying to predict the metric-to-predict.

Addonis · June 15, 2016, 5:05pm

Hmm, that makes sense. I guess what sucks is that I would then have to rerun it each time. Or maybe the code can somewhow be changed to do more metrics…unless the process you described is a fundemental limitation of the system.

yes

okay, that’s a bit different from the way I did it last time then. When it was just one metric, I just had my .csv file have one column with just the data (no index column).

Setus · June 15, 2016, 6:20pm

Correct, from I understand, that is indeed how HTM works, which means that you’ll indeed have to run it for every metric that you wish to predict.

rhyolight · June 15, 2016, 8:25pm

First off, kudos to both of you for having this conversation while I was away. Thanks @Setus for helping @Addonis out!

Try running the anomaly likelihood instead of the raw anomaly score. This is not actually a part of the NuPIC model, but a post-process:

https://github.com/numenta/nupic.workshop/blob/master/part-1-scalar-input/nupic_workshop/utils.py#L49-L52

As shown in the code above, you create an instance of the anomaly likelihood class and it maintains state. You pass it the value, anomaly score, and timestamp for each point and it will return a more stable anomaly indication than the raw anomaly score.

Here is a detailed explanation of the anomaly likelihood (with awful audio and video, sorry):

See Preliminary details about new theory work on sensory-motor inference.

Setus · June 16, 2016, 10:26am

Thank you @rhyolight I appreciate it
Since I got so much help from the community, I just had to give something back by helping others in need, especially for a topic that is so relevant and important for many future users/testers that will undoubtedly wonder the same things.

Addonis · June 16, 2016, 6:29pm

Yea, I’ve learned a ton from Setus.

So I can download utils.py to do this, or should I put that a different python file or something?[quote=“rhyolight, post:17, topic:735”]
Here is a detailed explanation of the anomaly likelihood
[/quote]
I’ve seen that video, but I’ll try to rewatch it if I have time.

Oh, I’ve read that post… I just didn’t think that what I was talking about here (creating more than one layer cortical columns) is what that Jeff Hawkins is working on now (I thought it was some kind of an algorithm to better receive and interpret/understand sensory-motor input)…I’ll have to watch the recent video on that as well…

rhyolight · June 16, 2016, 6:48pm

No, you can copy and paste those few lines of code (along with the import statement and initialization of an instance of the class earlier in the file) into your project. It’s just an example of how to use it. We don’t have a good document on it yet.

Topic		Replies	Views
Prediction for One Hot Gym - Large swarm NuPIC	2	566	July 11, 2019
Hotgym prediction example problem with --plot NuPIC	2	722	February 17, 2019
Strange behaviour of predictions for decoder/encoder parameters Engineering question , community	2	577	October 27, 2017
Why am I seeing lot of false positives? NuPIC	12	2484	June 22, 2016
Hot Gym 96 step ahead prediction: AssertionError NuPIC	5	1359	October 6, 2016

Is my data being predicted correctly?

Related topics