NuPIC model better matches new data than data it learned on

Hi All,

So I’m training separate NuPIC TemporalAnomaly models each with separate data. There are 7 subjects with 7 distinct data sets, used to train 7 distinct models. I’m saving each model and then calling it to run on each data set, so model 1 was trained on subject 1’s data, and then run against subject 1’s data along with subject 2’s through subject 7’s. For each set of data (1-7) run against model 1, I record the average anomaly score (and anomaly likelihood), to get a sense of which of the 7 data sets model 1 was least surprised by.

My Intuition/Curiosity
Intuitively I would expect for the average anomaly score to be lowest when model 1 was run again subject 1’s data, which is the same data it was trained on, though this actually isn’t the case. The model has a lower average anomaly score when run against subject 4’s data instead.

This isn’t totally shocking to me, though I wonder if there are more potential reasons for this than I realize. For one the data is extremely noisy. I’m using the simple scalar encoder on real-valued inputs that seem to move pretty chaotically. The N and W for the metric (called ‘Response’ are 275 and 21 respectively, and 115 and 21 for the classifier).

I figure that all this noise causes a lot of synapses of low permanence value to be formed, since there are many different sequences occurring that are likely not repeating with much regularity or even at all. With this big tangled mess of transitions learned, it seems reasonable to not expect the system to remember exactly what happened in time steps 1-10 when it’s just finished step 2000 with a ton of noise all along the way.

I imagine it like if you had a sequence of letters that began say ‘ABC…’ but then over the next 2000 steps went something like ‘A%^*B–#@C…’. By the time it finished learning all those largely noise-y sequences you couldn’t just expect it to predict a clean ‘ABC’ once the data is fed in again from the beginning.

The Data
I’m including a paste with the data from the first few time steps here. Column 1 marked ‘subject_1’ is the metric (what was originally fed in to train the model). Column 2 marked ‘subject_1_prediction’ is what the saved model predicted for each ‘subject_1’ value. Columns 3 and 4 are the anomaly score and anomaly likelihoods.

My Questions (finally):
– Does my theory about this make sense to you?
– Are there other aspects I’m not thinking about here?
– May there be a more effective way to summarize the matching between a saved model and a new data set than simply the average anomaly score or likelihood?

I really want to have as full an intuition as possible about how NuPIC learns, and how the resulting models are affected by the data and its noise level. I eagerly welcome anyone’s take on any of this. Thanks,

– Sam


Thanks for the detailed writeup and question. There are a number of ways to evaluate performance of a model against a dataset, anomaly score is one. Have you considered evaluating performance based on prediction using any of the available error metrics?

Given that the data is noisy, chaotic, and irregular, your intuition is probably not far off, but there may be things you can do to improve the performance before jumping to any conclusions or changing the overall approach. For example, have you considered using Random Distributed Scalar Encoder (RDSE)? Are you swarming?

Can you also provide additional details of the network hierarchy and parameters you’re using?


If you take one of the data sets and run it through all trained models, the average anomaly score across models can be compared. I would be somewhat surprised if the model trained on that data did not have the lowest anomaly score (although it is possible, especially if they data sets are different sizes and have similar patterns).

However, you cannot compare the average anomaly score from one model on different data sets. One data set may be inherently very unpredictable so it will have a high average prediction error (raw anomaly score) regardless of whether the model was trained on it. Also, I’d recommend turning learning off while doing the comparisons. Otherwise, model 1 may learn repeating patterns in data 4 and start to predict it well part way through, giving it a low prediction error even though it wasn’t originally trained on that data.

A couple other points:

  • The prediction error (raw anomaly score) isn’t a good comparison metric. You can use it to compare two models on the same dataset but if you want to run multiple datasets through the same model and compare, then the likelihood would be much better (although may still not work well, I’m not sure).
  • Given that you have a likelihood score of 0.5 for each metric, it seems like you might not have enough data. It would be best to have at least 1000 records per data set.

I’m not sure if that answers all of your questions so please follow up if anything is still unclear!

1 Like

Hey Austin,

Thanks for your consideration here! To answer your questions (with more questions):

  • Average anomaly is the only metric I’ve used so far, though I will try others. I’m thinking of MAPE and T-tests, and have also been recommended KL-divergence. Any other suggestions/intuitions on how you might approach it?
  • I haven’t tried the RDSE yet, will do. I’ll check the github code on how and what the params should be.
  • I thought of swarming but haven’t been successful, I’m getting this error:
    .OperationalError: (2003, “Can’t connect to MySQL server on ‘localhost’ ([Errno 111] Connection refused)”).
  • I’ve attached the params file I’m using for the time being. I landed on N & W values of 275 and 21 respectively. I took these from other TemporalAnomaly params files I had for other data sets, though I don’t remember exactly how they got there (default possibly?). With some of my other params files the N was 275, though for others it was just 29. 29 seemed way too small so I went with 275. I also tried smaller and larger values for N (115 and 400) with less success than 275. Do they seem sane enough?

Of course any general intuitions you have are hugely welcome both to help with this application and more generally for building my own NuPIC intuitions/chops. Thanks again man!

Hey Scott,

Thanks for your thoughts! The data sets are all the same size, though you make an interesting point about the anomaly scores for the different data sets and I want to make sure I understand it correctly:

  • If a data set is is more inherently unpredictable than another it will yield higher anomaly scores when run against any given model, so even the model that was trained on it may have high anomaly scores as would other models trained on other data. Is this what you’re saying?

  • In regards to turning learning off while running the previously trained models I am doing that, or at least I intend to. Here’s the loop I’m using within my modified run_model function.

  • In terms of the average anomaly likelihood I have tried that along with the average raw anomaly score and found it to actually do a worse job of matching the right data set to the right model. The likelihood is 0.5 for the first several hundred time steps but then jumps around. I assume the window length is such that it’s just ‘0.5’ until a certain time step. Each data set has about 6000 records.

I figure I may as well attach plots showing the data itself. There are 7 subject, each with their own file. You’ll see a similar general trend across subject, as the data is their recorded responses to an identical game, though there are differences between them. I wonder if the data may be generally too similar along with being too chaotic to reliably distinguish between them. Thanks again for all your intuitions here!!

Thanks for posting the data. My immediate reaction is that it doesn’t look like 6000 records. There are flat sections where it looks like the same value for >100 records. If that is the case, I think the data is very oversampled. Imagine if you just saw one value at a time. If you saw 100 records in a row with the same value then it would be hard to keep track of what sequence you are in. And all of different data samples would be very hard to tell apart.

This is a bit of a tricky dataset. In some ways it could be easy to model. If all the players regularly pause at the same value you may be able to tell players apart by comparing the average time the player stops for. But a general model for determining which player is which is harder because the data is so similar. You could try subsampling the data (keep subsampling a smaller and smaller number until the charts start to look much different) to see if that provides more easily distinguishable patterns. But I’m not sure how well this would work.

A great point yes. The sampling rate is very high, around 65 samples per second of the subject’s responses. It had occurred to me that people can’t do much of anything in 1/65th of a second, so you wind up getting the same response value repeated like 100 times in a row, cause that’s only like a 1.5 seconds of real time.

If I understand your suggestion right, I’m thinking of taking a subset of the current data, maybe every 3rd, 5th or even 10th data point. Keep taking a smaller subset until those charts start noticeably diverging from each other. The result would obviously be a data set 3, 5 or 10x smaller, though likely with more difference between them and shorter sequences of repeated values. Do I have this idea right? Thanks again for your input!!