Is my data being predicted correctly?

Hi Addonis, I don’t work at Numenta, but I’ve been part of the community for a little while now, so I’ll try to help you out, based on what I understand as of today.

I’ve watched that video myself and looked at the code. It’s fine really, although I’ve personally had problems running the plot in real time on my Ubuntu set-up, so I’ve kind of gone away from it, and prefer to print out all of HTM’s results in a CSV file and just plot all of the data manually in either LibreOffice or some other similar program. To run HTM properly to its full capacity, you need to both fulfill a couple of steps and understand some things. Let’s start with the “theory” behind running HTM.

  1. You need to create a configuration file which needs to be fed to the swarm. The config file has parameters that depend on both properties of the input file itself (most importantly the name of the metric that you wish to predict, it’s type and max-min values) and some custom choices (swarm over how many data set lines, the more the better but also slower, and what your focus is: TemporalAnomaly or MultiStep). Once the swarm has done its thing, it will generate a file called model_params.py inside of a folder called model_0 (and a couple of other files but you can ignore them). Two important thing to take note of concerning the generated model_params.py file, first, the swarm will always set the value of maxBoost to 2.0, that value must be changed to 1.0 for HTM to function properly, secondly, the swarm algorithm does not always give very good encoder parameter values. Read lower down to find out how to troubleshoot and fix badly generated encoder parameter values.

  2. Then, you need to feed that generated model_params.py file into an OPF model together with your input file, which will run HTM with those model parameters on your input file.

  3. Lastly, you need to decide how to extract and display your results from the OPF model in an output file. The extracting part concerns mostly predictions. HTM learns values in real time, but gives a prediction of the next value in the next time step, so do you want your predictions to always “lag” behind or be displayed at the same time as the current value is given. Example: you feed HTM these values in this order, 1,2,3,4,5,6,7,8,9,10,1,2,3,4,5,6,7,8,9,10,1,2,… The first time HTM sees 1 through 10, it obviously didn’t know anything, but the second time, do you want HTM to say that 4 was predicted when 4 was fed to HTM (aka it predicted the next value to be 4 at time step 14 and it indeed was fed 4) OR it predicted 4 when 3 was fed (aka HTM got fed 3 and predicted at time step 13 that the next value would be 4)? The main difference between those 2 ways of extracting values will result in either predictions always lagging behind the input data set values or being printed simultaneously as the data set values.

How to troubleshoot the swarm’s generated model_params.py file for bad encoder parameters.
Open the model_params.py file and look at the encoders section and see values such as
{'clipInput': True, 'fieldname': 'metric1', 'maxval': 18.2, 'minval': 0.7, 'n': 22, 'name': 'metric1', 'type': 'ScalarEncoder', 'w': 21}
As you can see in this example n has a value of 22 while w has a value of 21. I’m not sure how much HTM theory you understand, but n should be much larger than 22, generally between 300-400, but definitely not less than 100. If your generated model_params.py file has such values, I recommend using another encoder type called RandomDistributedScalarEncoder, by simply deleting all of the previous values above and inserting those instead
{"name": "metric1", "fieldname": "metric1", "resolution": 0.1, "seed": 42, "type": "RandomDistributedScalarEncoder"}
I recommend looking at the video about RandomDistributedScalarEncoder to understand how it works, but here is what I understand about it: you don’t need to give it max-min values, but you do need to give it a good resolution value. A good resolution value depends on your data set values. The encoder has 1000 buckets to put your values into, so depending on your chosen resolution, different values may fall into different or the same buckets. Example: data set values are [1,2,3,4,5,6,7,8,9,10] and resolution is 1, this means that each number will fall into its own bucket ie bucket 0 will hold value 1, bucket 1 holds value 2 and so on. If the resolution were to be 2, bucket 0 would hold both values 1 and 2, bucket 1 would hold values 3 and 4 and so on. However, if the resolution were 0.1, bucket 0 would hold value 1, then buckets 1-9 would be “empty” (they are not, values 1.1 through 1.9 would be stored accordingly inside buckets 1-9) and bucket 10 would hold value 2, then buckets 11-19 would be “empty” and bucket 20 would hold value 3 and so on. In my tests, I have found it to be preferable to have about 10 “empty” buckets between values of significant difference. So if your data set had values [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0] then you would need a resolution of 0.01.

My setup
I have created a template for easy all-in-one swarming, OPF running and result outputing on a csv file, which can be downloaded here from my Google Drive. The template consists of a couple of things, a main processing file which does the all-in-one part called process_input.py and a folder called model_0 with a practically empty model_params.py file which is the file onto which the model parameters will be written upon (and a neccessary init.py file). To run it all, simply add your data set input file inside of the template folder, then in Ubuntu open your terminal, go to the template folder location and write

./process_input.py input_file_name predicted_field_name last_record date_type output_file_name swarm

The predicted file name is obviously the name of the metric which you wish to be predicted (in case you had more than 1 metric inside your input file). Last record says how many records (by lines) you wish the swarm to run over, 3000 is preferable for decent results. Date type is if you are using dates or simple numbers as your “index of reference” for your data set values, this can be OFF if you are simply using numbers, US for american style dates and EU for european style dates. Swarm is True or False, if you want to swarm over the input file first. The point is that if you already swarmed over an input file which has filled the model_params.py file, you may want to change some values inside of model_params.py and immediately rerun HTM with that modified file without having to swarm all over again. Additionally, you shouldn’t have to fill out the above mentioned swarm configuration file as my program automatically fills it out when reading an input file. So an example run might be

./process_input.py input-file.csv metric1 3000 OFF output-file.csv True
or
./process_input.py input-file.csv metric1 3000 OFF output-file.csv False

The file was originally based off of Matt’s examples, but I modified it heavily for my needs. It should hopefully be self-explanatory, but I have to admit that that my python skills are still quite basic.

5 Likes