Questions on continued learning with multiple inputs after invoking model.run()

I have multiple data inputs (let’s say 10 for example) and am doing anomaly detection and prediction of the data in column-2. Column 1 contains the time-stamp, column 2 contains the desired output, and columns 3-12 contain the 10 inputs HTM is analyzing to predict column 2.

I swarm over the data and save the resulting model. I then loop over each row in the data calling model.run() and passing in only two arguments: time-stamp and column 2 (desired-output). Everything seems to work fine, and it magically spits out predictions that are quite accurate. But, I don’t know how to feed in new data from the 10 inputs into the model, which wasn’t available when it swarmed over the data.

Should I pass any new data into model.run() as a part of the input record? Is Model.setFieldStatistics() the right method to use? What’s the best way to accomplish this?

Big question… do the model params created by the swarm include encoder configuration for any for those 3-12 supporting columns? If not, that means the swarm did not find them to be useful contributions to the prediction.

I’m confused by this statement. You should be sending in the same data structure to the model as you sent to the swarm.

It is usually helpful if you attach your swarm description, model params, and a short sample of the data.

1 Like

Thanks for the response! I uploaded the swarm description, model params and the data file to my dropbox. You can download them here.

I’m having nupic predict time-series data. In order to “prove” to myself and others that it’s actually predicting, or has no look ahead bias, I’m swarming over the 2015 data to build the model. Then, I’m running the 2015 and 2016 data through the model, using model.run().

I’ll use a moving ‘percentage of accuracy’ window to compare predictions for 2015 with 2016. It’s my way of being objective about the accuracy of predictions on data the swarm has had a chance to optimize for VS data it’s never seen before. Of course, if there’s a better way, I’d love to hear it.

My intention is to eliminate any look ahead bias swarming over the 2016 data might introduce. I understand Nupic isn’t a traditional machine learning framework, but this is a hoop I need to jump through to prove the concept.

P.S. I experimented with changing the data fed into model.run(), because I noticed some inconsistent results. It’s entirely possible I flubbed up the swarm’s configuration. I’m still getting comfortable with Nupic’s framework.

In my many attempts, I observed model.run() returned predictions less accurate when I fed in all of the data that was fed into the swarm. But, interestingly when I fed in only the ‘set1’ column to model.run(), the accuracy of it’s predictions improved.

Yesterday I binge watched episodes 0-10 in HTM School. Really informative and entertaining stuff by the way, thanks so much for putting those out there! I now feel like I’m more able to understand what to tweak to boost (pun intended) accuracy of its predictions. But, I’m probably gonna need to watch them again and dive more into Nupic’s code. So if something I say seems like I don’t know what doing, I might not! :wink:

2 Likes

Hey Mike, I saw your comments on the videos. Thanks for watching! Glad they were useful for you.

Is there any chance you can put your code up on Github? I would love to take a look at it to see if there is anything I can help with.

Thanks Matt! The code I’m working on has turned into an eight humped camel, held together with a bit of duct tape. I’m working on extracting the key parts of code into a new project and I’ll post that on Github.

For now, can I confirm my understanding of the data structure I should feed into the model.run() method?

The comments in CLAModel.run() say I should feed into the run() method an input record “formatted according to nupic.data.RecordStream.getNextRecordDict() result format”. It looks to me like getNextRecordDict() returns a python dictionary object, with the key set the ‘fieldName’ specified in the swarm description, and the value being the value from my data (eg. the CSV file I’m using as input). Is my understanding here correct?

Additionally, I fed ~46 columns of data into the swarm. I’ve observed if I feed in all 46 columns of data its predictions aren’t very good. But, its predictions improved remarkably when I feed in only the columns titled ‘set1’ and ‘set11’ (at least I think it’s set11, it’s a bit late a the moment, and I get a bit slappy at this hour).

It’s my understanding the model is only using two columns to make its predictions, columns ‘set1’ and ‘set11’. I did a quick test just now and it looks like CLAModel.getFieldInfo() tells me which data from my CSV file the model is using. Is my understanding correct here and can I just feed in those two columns of data to CLAModel.run()?

That is correct. For example, if the CSV file you built for your swarm started like this:

timestamp,consumption,temp
datetime,float,float
T,,
7/2/10 0:00,21.2,65.3

To feed these first row of data, assuming that timestamp, consumption, temp are strings read from the csv file:

model.run({
    "timestamp": datetime.datetime.strptime(timestamp, "%m/%d/%y %H:%M"),
    "consumption": 21.2,
    "temp": 65.3
})

Wow… how long did the swarm take? I would only recommend 8 fields of data per model.

If you look at the model params the swarm created, it only created encoder params for fields that it decided had an affect on the prediction of the predicted field. Any fields that don’t have encoders in the model params can be removed from the whole system. They are not being used except perhaps to contribute to the anomaly scores.

1 Like

It’s usually recommended to feed just the output field into the model rather than all those other fields, or try that first at least. It often works better to reduce the number of fields as much as possible, because NuPIC is designed to learn from the temporal patterns in the field, not the relationship between inputs field(s) and an output field as is common with other machine learning approaches.

As Matt described the swarming process helps determine which of all those fields is actually useful for predicting the output field, so that’s worth doing, though they’ve found general purpose model params that work well across domains, so you could even use those and skip the swarming process altogether. That way you wouldn’t have to worry about any look-ahead bias.

If I’m off base on any of this please someone let me know, though I’ve heard these things said and I’m pretty sure they’re generally true. I’m now using NuPIC without swarming by working inside the hotgym/anomaly/ directory, which comes with a set of parameters, rather than the hotgym/prediction folder which requires swarming. Just some thoughts,

– Sam

2 Likes

Thanks guys for the input!

In answer to @rhyolight 's question, the model took ~ 1 hour 30 minutes to swarm.