Understanding NuPIC and troubleshooting to get the best results

So after almost half a year of using NuPIC and trying to get it to work as best as possible on different data sets, I have to admit that the swarm and model_params.py file (that the swarm generates) have been a thorn in my side for quite a while now. So I hope to be able to alleviate some of the pain by understanding the technicalities (not aiming to criticize anything or anybody).


A

First, couple questions regarding the config file for the swarm

SWARM_CONFIG = {
    "includedFields": [
        {
            "fieldName": predicted_field_name,
            "fieldType": predicted_field_type,
            "maxValue": max_value,
            "minValue": min_value
        }
    ],
    "streamDef": {
        "info": predicted_field_name,
        "version": 1,
        "streams": [
            {
                "info": predicted_field_name,
                "source": "file://" + input_file_name,
                "columns": ["*"],
                "last_record": last_record
            }
        ]
    },
    "inferenceType": "TemporalAnomaly",
    "inferenceArgs": {
        "predictionSteps": [1],
        "predictedField": predicted_field_name
    },
    "iterationCount": -1,
    "swarmSize": "medium"
}
  1. What does the “columns” attribute mean, and what are the possible values apart from * ?
  2. Does the “last_record” attribute tell the swarm how many lines of the predicted_field column to read when swarming over the input file? Is it correct that it should be at least 3000 for good results?
  3. The “inferenceType” attribute TemporalAnomaly must be used in order to extract anomaly scores and anomaly likelihoods from the CLA, but the other available inference types are MultiStep and TemporalMultiStep, right? I know that MultiStep is explained to be used when one needs predictions for more than 1 step ahead. But what is TemporalMultiStep and what are the differences between MultiStep and TemporalMultiStep then?
  4. What is the “iterationCount” attribute and what are the possible values?

##B
Now, on to the swarm’s generated model_params.py file.

  1. I have sometimes experienced that the swarm cannot find either good or any encoder parameters for the predicted_field_values in my input file. I either get

    'sensorParams': {'encoders': {u'my_predicted_field_name': None}}

    Or get extremely bad encoder parameters that would completely confuse the CLA by encoding almost all values to be the same.

     'sensorParams': {'encoders': {u'my_predicted_field_name': {'clipInput': True,
                                                        'fieldname': 'my_predicted_field_name',
                                                        'maxval': 84.4,
                                                        'minval': 0.0,
                                                        'n': 22,
                                                        'name': 'my_predicted_field_name',
                                                        'type': 'ScalarEncoder',
                                                        'w': 21}}
    

    Am I correct that this happens, and not just with me?

  2. Sometimes, the swarm adds a “_classifierInput” encoder, like this (actual example taken from my swarm run)

     'sensorParams': {'encoders': {u'TAM': {'clipInput': True,
                                                           'fieldname': 'TAM',
                                                           'maxval': 23.8,
                                                           'minval': -15.0,
                                                           'n': 148,
                                                           'name': 'TAM',
                                                           'type': 'ScalarEncoder',
                                                           'w': 21},
                                   '_classifierInput': {'classifierOnly': True,
                                                             'clipInput': True,
                                                             'fieldname': 'TAM',
                                                             'maxval': 23.8,
                                                             'minval': -15.0,
                                                             'n': 156,
                                                             'name': '_classifierInput',
                                                             'type': 'ScalarEncoder',
                                                             'w': 21}}
    

    What is this _classifierInput, why is it sometimes added, and what does it do? I’ve tried removing it in my tests, and my predictions and anomaly scores became very bad.

  3. If the swarm gives me bad encoder parameters, is it actually worth replacing the encoder with the newer RandomDistributedScalarEncoder with an appropriate resolution setting? I mean, can I assume that the swarm found good enough values for the other parameters other than the encoder parameters or should I assume that the swarm didn’t manage to find good values for the rest either?


##C
When I run the CLA (in TemporalAnomaly mode) on a data set with an extremely simple and repeating pattern, everything is fine and gets predicted correctly.
However, if I generate a data set with 5000 random (integer) values between 0 and 100, it struggles, which is to be expected. Because the swarm’s encoder parameters were very bad, I had to use the SDRE with varying resolutions (tried resolutions 1.0, 0.5, 0.1, 0.05 and 0.01)

  1. Is it correct that when the CLA doesn’t have enough data to give any predictions (right at the beginning or when getting values that it has never received before), it predicts that same value?
    Example:
    Time_index=1, Value=6.0, Prediction=6.0
    Time_index=2, Value=34.0, Prediction=34.0

    Time_index=140, Value=78 (never previously seen), Prediction=78

  2. I noticed that with high resolutions such as 1.0 and 0.5, the CLA was highly unwilling to give almost any predictions, that is, it seemingly repeats back exactly the same values that it gets at that exact time step. According to my understanding, with a resolution of 1.0, each value between 0 and 100 would get its own bucket, and there would be a high level of overlap between adjacent buckets, so that I believe there should have been plenty of attempted predictions. Any ideas as to why?

  3. However, it was when using the resolutions 0.1 and 0.05 that the CLA was most liberal in it’s predictions. Any ideas as to why? (I know that at the low resolution of 0.05 and lower, all 100 values could not get their individual buckets so that all values from a certain value less than 100 (example 37 and up) were cramped inside bucket index 999)

  4. I noticed that when I feed the CLA with very complex data that follows a rough pattern, it seems to struggle in the same way as explained for 1. it simply repeats the same values back. This becomes even more apparent when I ran a MultiStep prediction with predictions for multiple days in advance, then I get values like this (Prediction 1 day ahead, 2, 3)
    Index, TAM, P1, P2, P3
    0, -6.7, -6.70, -6.70, -6.70
    1, -10.2, -10.20, -10.20, -10.20
    2, -6.1, -6.10, -6.10, -6.10
    3, 1.5, 1.50, 1.50, 1.50
    4, 1.2, 1.20, 1.20, 1.20
    5, -2.1, -2.10, -2.10, -2.10
    6, 0.8, 0.80, 0.80, 0.80
    7, 0.0, 0.00, 0.00, 0.00
    8, -1.1, -1.10, -1.10, -1.10
    9, -1.1, -1.10, -1.10, -1.10
    10, -1.8, -1.80, -1.80, -1.80

    4440, 2.1, 1.71, 1.71, 2.07
    4441, 3.1, 3.42, 3.70, 3.70
    4442, 4.0, 3.42, 3.70, 3.70
    4443, 5.0, 4.80, 4.50, 4.01
    4444, 3.8, 3.42, 3.70, 4.01
    4445, 3.3, 3.73, 3.73, 3.73
    4446, 4.9, 3.73, 3.73, 4.01
    4447, 3.7, 3.38, 3.73, 3.73
    4448, 4.8, 3.72, 3.72, 4.01
    4449, 4.5, 3.72, 3.72, 4.01
    4450, 5.4, 4.82, 3.72, 4.50
    And later as things go on, the predictions for the next following days are very often the same, even though the actual data never occurred as such. Any ideas/tips?

It may seem as if these questions are rather unrelated, but underneath it all, I find that it all comes down to the parameters in the model_params.py file. If they are good enough, the CLA manages to give good prediction results and anomaly scores.

1 Like

Ok, so following the age old saying that ‘the Community helps those that help themselves’, I’ve found some answers to my questions.
Most answers to questions section A were found here

A1.

It would seem that the columns attribute tells the swarm whether to include all available columns as input when feeding itself with values when trying to run models. However, I suspect that it only matters when multiple columns have been declared into the ‘includedFields’ array. So for example, if an input file consists of date, metricA and metricB, and those 3 metrics were individually added in includedFields and columns is *, then all those 3 columns are fed as input into the models, while if only date and metricB are in the ‘columns’ attribute, then only those 2 are fed as input, even though all 3 of them are declared inside includedFields.

A2.

The last_record field specifies how many records to run. Leave this OUT to run on the whole file. Leave this to 100 if you want to do quicker runs for debugging.

A4.

The iterationCount value gives the maximum number of aggregated records to feed into the model. If your data file contains 1000 records spaced 15 minutes apart, and you’ve specified a 1 hour aggregation interval, then you have a max of 250 aggregated records available to feed into the model. Setting this value to -1 means to feed all available aggregated records.


As for questions relating to section C, I forgot about this video explaining that NuPIC’s ability to predict future values was a purely engineered solution that has no biological basis, which means that NuPIC’s ability to correctly predict values is technically not representative of the HTM’s/CLA’s ability to predict values (once a fully connected multi region CLA algorithm is hopefully made). While this doesn’t explain why I get bad values, I guess it maybe points to the problem being that the data is too noisy/random for the CLAClassifier to be able to correctly predict future values?

5 Likes

Hey @Setus, sorry I did not respond to you earlier. And also, thanks for digging up these answers and posting them for other users!

3 Likes