How to handle missing data points?

I’ve been experimenting with making predictions from data where some of the data points don’t exist in the data source. Here’s an example of the data I’m using, this is just a small excerpt. In total, there are about 40 columns of data.

The data is coming from several different sources. Each data point is time stamped, but there are time stamps where a data point exists in one data source, but not another. I’m not sure how to best encode this data, but here’s what I’ve tried, with limited success.

  1. I filled in the blank cells with zeros, but the predictions have a high probability of also being zero, which makes sense, but isn’t what I’m looking for.

  2. I’m running an experiment now where I’ve left the cells to be blank. This seems like the most correct approach, but I’m just using my intuition here. I have no solid understanding to backup my “gut feeling”.

  3. I thought about aggregating the data into 1min or 5 min time slices so there would be no blank cells, no missing data. But, it seems like precision would be lost in regard to the time dimension. I haven’t tried this yet, and again, this is just my “gut feeling”.

Is NuPIC architected to handle missing data points? And if so, how should I encode the data to take full advantage of NuPIC’s predictive abilities?

1 Like

I’m sure this question has come up before, but I would be interested in what the treatment is too?

1 Like

In a multiple field scenario like this, missing data in input rows should be replaced with the last known value of that data. NuPIC will not accept None as an input (at least not through the OPF). And using 0 is misleading because 0 is a valid input, so you would essentially be replacing “no data” with “wrong data”.

Using the last known value is the best option, IMO, because it is the least misleading. It basically says “nothing has changed in this field” which is not the whole truth, but the closest to the truth that you have.

2 Likes

It’s possible, I did search the forums before posting, but couldn’t find anything specific.

Thanks @rhyolight. I was afraid that might be the answer, but I also don’t see any other logic solutions to this problem. It was a challenge in my work with ANNs as well.

It does seem to be causing problems with my model, because after I added a lot of new data that’s has a strong correlation to the solution I’m asking it to search for, the accuracy of its predictions went down from ~56% accurate, to 42.3% accurate.

What do you think about aggregating the data into the nearest time unit? For example, all data points from 12:00:00 - 12:05:00 would be aggregated into one data point and timestamped with 12:05:00. I’ll probably try it, and post results of course, but I’m just curious if anyone has any other potential solutions to this challenge?

A good idea, you should certainly try it. Aggregation is a typical tactic we use to find that data “sweet spot”.

1 Like

@rhyolight

Is it perhaps more biologically plausible to simply skip those records and just input what there is? For instance, if I pass out, bang my head, fall asleep for a second - then wake up and start reading something I was previously reading, my understanding doesn’t lapse, I just start from where I left off?

Interesting question…

1 Like

For a single input field, yes. But @mellertson’s data has multiple inputs. There is no way to skip a row without skipping other input data along with it.

2 Likes

I think I’m going to try aggregating the data into consistent time intervals for my next experiment, maybe 5 minute data. Because, I was just thinking more about it. And I realized, if I’m inputting data corresponding to a varying timestamp interval, what timestamp will I be asking it to predict?

For example, if the input data has the following timestamps:

12:00:00
12:01:00
12:03:45
12:03:57
12:05:26

What time stamp will it be trying to predict if I ask it for 1 timestep into the future? I Imagine it might try to estimate a non-linear function to predict the next time stamp, and use it to predict the next value. If that’s correct, it’d be making a prediction based on a prediction, which would be inherently less accurate.

Anyone know how NuPIC would determine what time stamp to make its prediction for?

You should definitely feed data at the same time interval.

The only concept of time the HTM knows is the date encoding it’s getting. Other than that, it only knows that one point follows another. Your brain works similarly. It only knows that it gets a stream of input, that’s the only temporal structure inherent in the system. We add the idea of dates and times using encodings.

After a night’s sleep, you can’t tell what time it is until you get some contextual input from your environment (position of the sun, a watch). The HTM is the same way. The only way it knows about time is because we’re encoding time for it.

If you are not feeding in data at consistent time intervals (with consistent time interval date encodings in the input), it’s probably not going to make predictions the way you expect it to, because the prediction will also include a guess about what time interval encoding might be coming next. I’m not even sure if the classification step is going to work properly with staggered input timing.

(Hopefully I am right about this, please anyone correct me if I’m wrong.)

1 Like

A post was split to a new topic: Getting predictions farther in the future than the input interval