So after almost half a year of using NuPIC and trying to get it to work as best as possible on different data sets, I have to admit that the swarm and model_params.py file (that the swarm generates) have been a thorn in my side for quite a while now. So I hope to be able to alleviate some of the pain by understanding the technicalities (not aiming to criticize anything or anybody).
A
First, couple questions regarding the config file for the swarm
SWARM_CONFIG = {
"includedFields": [
{
"fieldName": predicted_field_name,
"fieldType": predicted_field_type,
"maxValue": max_value,
"minValue": min_value
}
],
"streamDef": {
"info": predicted_field_name,
"version": 1,
"streams": [
{
"info": predicted_field_name,
"source": "file://" + input_file_name,
"columns": ["*"],
"last_record": last_record
}
]
},
"inferenceType": "TemporalAnomaly",
"inferenceArgs": {
"predictionSteps": [1],
"predictedField": predicted_field_name
},
"iterationCount": -1,
"swarmSize": "medium"
}
- What does the âcolumnsâ attribute mean, and what are the possible values apart from * ?
- Does the âlast_recordâ attribute tell the swarm how many lines of the predicted_field column to read when swarming over the input file? Is it correct that it should be at least 3000 for good results?
- The âinferenceTypeâ attribute TemporalAnomaly must be used in order to extract anomaly scores and anomaly likelihoods from the CLA, but the other available inference types are MultiStep and TemporalMultiStep, right? I know that MultiStep is explained to be used when one needs predictions for more than 1 step ahead. But what is TemporalMultiStep and what are the differences between MultiStep and TemporalMultiStep then?
- What is the âiterationCountâ attribute and what are the possible values?
##B
Now, on to the swarmâs generated model_params.py file.
-
I have sometimes experienced that the swarm cannot find either good or any encoder parameters for the predicted_field_values in my input file. I either get
'sensorParams': {'encoders': {u'my_predicted_field_name': None}}
Or get extremely bad encoder parameters that would completely confuse the CLA by encoding almost all values to be the same.
'sensorParams': {'encoders': {u'my_predicted_field_name': {'clipInput': True, 'fieldname': 'my_predicted_field_name', 'maxval': 84.4, 'minval': 0.0, 'n': 22, 'name': 'my_predicted_field_name', 'type': 'ScalarEncoder', 'w': 21}}
Am I correct that this happens, and not just with me?
-
Sometimes, the swarm adds a â_classifierInputâ encoder, like this (actual example taken from my swarm run)
'sensorParams': {'encoders': {u'TAM': {'clipInput': True, 'fieldname': 'TAM', 'maxval': 23.8, 'minval': -15.0, 'n': 148, 'name': 'TAM', 'type': 'ScalarEncoder', 'w': 21}, '_classifierInput': {'classifierOnly': True, 'clipInput': True, 'fieldname': 'TAM', 'maxval': 23.8, 'minval': -15.0, 'n': 156, 'name': '_classifierInput', 'type': 'ScalarEncoder', 'w': 21}}
What is this _classifierInput, why is it sometimes added, and what does it do? Iâve tried removing it in my tests, and my predictions and anomaly scores became very bad.
-
If the swarm gives me bad encoder parameters, is it actually worth replacing the encoder with the newer RandomDistributedScalarEncoder with an appropriate resolution setting? I mean, can I assume that the swarm found good enough values for the other parameters other than the encoder parameters or should I assume that the swarm didnât manage to find good values for the rest either?
##C
When I run the CLA (in TemporalAnomaly mode) on a data set with an extremely simple and repeating pattern, everything is fine and gets predicted correctly.
However, if I generate a data set with 5000 random (integer) values between 0 and 100, it struggles, which is to be expected. Because the swarmâs encoder parameters were very bad, I had to use the SDRE with varying resolutions (tried resolutions 1.0, 0.5, 0.1, 0.05 and 0.01)
-
Is it correct that when the CLA doesnât have enough data to give any predictions (right at the beginning or when getting values that it has never received before), it predicts that same value?
Example:
Time_index=1, Value=6.0, Prediction=6.0
Time_index=2, Value=34.0, Prediction=34.0
âŚ
Time_index=140, Value=78 (never previously seen), Prediction=78 -
I noticed that with high resolutions such as 1.0 and 0.5, the CLA was highly unwilling to give almost any predictions, that is, it seemingly repeats back exactly the same values that it gets at that exact time step. According to my understanding, with a resolution of 1.0, each value between 0 and 100 would get its own bucket, and there would be a high level of overlap between adjacent buckets, so that I believe there should have been plenty of attempted predictions. Any ideas as to why?
-
However, it was when using the resolutions 0.1 and 0.05 that the CLA was most liberal in itâs predictions. Any ideas as to why? (I know that at the low resolution of 0.05 and lower, all 100 values could not get their individual buckets so that all values from a certain value less than 100 (example 37 and up) were cramped inside bucket index 999)
-
I noticed that when I feed the CLA with very complex data that follows a rough pattern, it seems to struggle in the same way as explained for 1. it simply repeats the same values back. This becomes even more apparent when I ran a MultiStep prediction with predictions for multiple days in advance, then I get values like this (Prediction 1 day ahead, 2, 3)
Index, TAM, P1, P2, P3
0, -6.7, -6.70, -6.70, -6.70
1, -10.2, -10.20, -10.20, -10.20
2, -6.1, -6.10, -6.10, -6.10
3, 1.5, 1.50, 1.50, 1.50
4, 1.2, 1.20, 1.20, 1.20
5, -2.1, -2.10, -2.10, -2.10
6, 0.8, 0.80, 0.80, 0.80
7, 0.0, 0.00, 0.00, 0.00
8, -1.1, -1.10, -1.10, -1.10
9, -1.1, -1.10, -1.10, -1.10
10, -1.8, -1.80, -1.80, -1.80
âŚ
4440, 2.1, 1.71, 1.71, 2.07
4441, 3.1, 3.42, 3.70, 3.70
4442, 4.0, 3.42, 3.70, 3.70
4443, 5.0, 4.80, 4.50, 4.01
4444, 3.8, 3.42, 3.70, 4.01
4445, 3.3, 3.73, 3.73, 3.73
4446, 4.9, 3.73, 3.73, 4.01
4447, 3.7, 3.38, 3.73, 3.73
4448, 4.8, 3.72, 3.72, 4.01
4449, 4.5, 3.72, 3.72, 4.01
4450, 5.4, 4.82, 3.72, 4.50
And later as things go on, the predictions for the next following days are very often the same, even though the actual data never occurred as such. Any ideas/tips?
It may seem as if these questions are rather unrelated, but underneath it all, I find that it all comes down to the parameters in the model_params.py file. If they are good enough, the CLA manages to give good prediction results and anomaly scores.