I have a problem with the predictions that NuPIC makes and I am trying to figure out if there is a mistake or not.
My predicted variable is set to be clipped by the encoder at some minimum and maximum values (MODEL_PARAMS). Nevertheless, NuPIC frequently yields predictions that lie outside of that range. As far as I understand the CLA classifier, this should not be possible. So my question is whether it is possible to have values predicted outside of the clipped range.
I also have a minor secondary question: is it possible to nullify the input space for one particular variable? I have some missing data, but want to make full use of the other variables at the same point in time. So instead of passing an encoded ‘0’, I would like to pass nothing at all for that area of the input space.
Can you paste your encoder parameters for this field from the model params just to verify?
I don’t think this should happen.
I suggest dealing with missing field data by re-using the last data point. It is better than an empty SDR or random noise.
An architectural/procedural suggestion just came to mind about this.
Numenta might want to:
Come up with “known” solutions for these types of things, and build them into the Network API? For instance, for missing data there could be a symbol or other directive/syntactic mark one could put in their data file to indicate a missing data point. Then internally, the NAPI could see this and do whatever the “best practice” is for that known case, such as re-insert the last line as you suggested.
Start building best practices for known corner cases and handling them in such a way where the user doesn’t have to “figure out” the best way or experiment / diddle with the API. This way, if a “best practice” changes for things such as missing data, then Numenta only has to change the use case implementation underneath with the user having to be concerned?
As the API starts to become more “hardened” (i.e. approaches 1.0 status), these considerations should have formal treatments?
I believe this would take a change not only to the NAPI, but also the encoders API. Encoders would need to keep a 1-row history of seen data to provide this to the HTM to re-insert the last line.
Also, I’m not certain that re-inserting the last line is always going to be the best tactic in every case. I would not want to hard-code that behavior.
Perhaps the OPF could add this functionality, but I don’t think the NAPI is the right place.
I wasn’t suggesting “hard-coding” anything, really. My point is that minutiae of usage should start to migrate out of the user’s domain, and into the API’s or Infrastructure’s domain - such that eventually corner cases have a way to have their “treatments” evolve along with the software and current methodologies without the user having to ever be concerned? I could care less what the actual solution for the above problem is - just that such problems start to be captured in a way where their solutions move into the domain of the release?
I was using the “missing data” example only as a concrete example of when this type of handling might become beneficial as a generic way to evolve the software.
Not disagreeing with you, just saying I think this functionality would be better expressed as a part of a higher level API like the OPF.
Ah. Cool. Hmmm… I think I suggested the NAPI because there’s a corollary for it in HTM.Java, and also that this particular solution might involve a “marker” in a csv file that get’s interpreted by a “sensor” which I believe is in the domain of the NAPI (underneath the OPF) ? But I can’t really comment on the specific solution as I’m not as familiar with the Python or C++ NAPI code…
…but procedurally it might be added to a “list” of things to inboard?
Thank you for your help!
This is the entry for _classifierInput:
In this case, ‘topredict’ represents a ratio of some items on a balance sheet, sometimes exceeding 1 or dipping below -1.
Thank you for your suggestion on how to handle missing data Rhyolight, but I think it might not be ideal for my particular situation. My data consists of separate sequences and not one continuous stream of data, so I will try out using noise, as you mentioned.
If the gap is a part of the sequence, you can just encode the gap as a unique input. For example, if you are encoding notes in a song, and the song includes rests, you should probably encode the rests in a similar fashion as the notes. This is easy when you are encoding categories, because there is no overlap between encodings. But if you have numerical encodings, you’ll have to try to decide what the gap in your data means, and how it should be represented in the semantic encoding.
Yes, I understand.
I keep running into the same issue. Is it possible that the classifier uses the original value of the predicted variable instead of the clipped value? If I understand correctly, the classifiers associate each bucket with a moving average of values that fell into that bucket. Is the value used supposed to be the original value?
Have you tried using the SDR Classifier? This is more recent. (@ycui you might have some input on this discussion.)
Thank you for your reply. Yes, actually now, I am working with the SDR classifier. Or at least I think so, in my model params it looks like this:
For me, it is not really a problem, I just clip the values beforehand. Nonetheless, I thought it might be a detail that was overlooked and could be fixed, except, evidently, if it is purposefully so.
The SDRClassifier predicts a probability distribution over a set of “buckets” that covers the input space. It also tracks the rolling average of inputs that get assigned to each bucket. The average value of the most likely bucket is returned as the predicted value. I think this issue is due to how we track average.
Let’s say that the input space goes from 0 to 100, and that the encoder divides that space into 10 buckets. If the classifier input is bucketIdx=10, actValue=200, the average value for inputs that lies in the 10th bucket will be not be larger than 100. The next time the classifier predict the 10th bucket, you will see a predicted value larger than 100.
I agree this is not ideal. For now, I suggest you clip the inputs as you described. In the future, we can add some logic in the Network API to prevent this from happening.