How to use NUPIC for simulated streaming data


#1

Hi,

I am a newbie in Numenta. I was able to install nupic and get the hot gym example up and running. I used it for both prediction and anomaly score detection. Now, I have some artifically generated medical data. There is a time-stamp associated with each row of data and there are many other attributes like heartbeat readings, some sensor readings etc. Before moving onto multivariate analysis, I want to just use the time-stamp and the heartbeat as my attributes. There are 3.85 million records. I currently have this data in a Cassandra cluster in my server machine. I wanted to simulate a streaming kind of scenario where lets say I extract 1 row per sec from this cassandra table and treat this as real-time streaming data. I want to now run NUPIC to give me anomaly scores on this data. These are my questions

Where do I stage the data that is read from the Cassandra Table. In the ~/nupic/examples/opf/clients/hotgym/prediction/one_gym, there is the rec-center-hourly.csv file. Do I change tablename and put my data in that file. Or there is some other smoother way.

Lets say the hearbeat readings fluctuate a lot, so I dont to be in a scenario where I swarmed just once over the data uptil that point, but rather have a system where as new data flows in the model automatically keeps on updating and gives me the best anomaly score. What are the code changes that need to be done for this. If there is no way of automatically doing this, do i feed data into that rec-center-hourly.csv file and swarm over it lets say every 1 min and then compute the anomaly score. But in this case there would be a lag to take into account the time it took to run the model, to get corresponding anomaly scores of data points. I want to visulaize my data, so want to avoid this lag.

Please suggest me the best way of doing this exercise.

Thanks


#2

You would store your data file in the ~/nupic/examples/opf/clients/hotgym/prediction/one_gym folder where the rec-center-hourly.csv and the ‘run.py’ file are. Once its there just go into the ‘run.py’ file and change ‘GYM_NAME’ to the name of the new .csv file, and make a params file called 'new_csv_name_model_params.py '(in the model_params folder). I do this by duplicating a file already there like ‘rec_center_hourly_model_params.py’, renaming it to match the new csv, then going in and changing ‘minval’, ‘maxval’ and ‘predicted field’ to match the new data.

In terms of swarming I believe this is usually done just once for each data set, not continually re-run as the data streams in. Since you have so much data it may take a long time to stream over it all, so I would limit the number of lines to stream over. This can be done within the ‘swarm_description.py’ file in the same folder. At the very bottom of the file are these lines:

“iterationCount”: -1,
“swarmSize”: “medium”

The ‘-1’ tells it to swarm on all rows of data up to the last row, but you could replace this -1 with something else, say 5000, which will make the swarm run much faster. I may stand corrected but I believe I have this pretty right and hope it helps,
– Sam


#3

In the hot gym example, there is a loop that reads lines from the CSV file and computes them through NuPIC. You can feed data into NuPIC using any data source you like. It doesn’t have to be a CSV file. For example, you can change the loop to use a Cassandra cursor and feed one row in for each item data store.

Here is a simple example of this using data gather via HTTP from River View:


#4

@sheiser1 @rhyolight

Thanks Matt and Sam . Appreciate you taking out precious time to reply to this post.

I cannot just have a single swarm run over say 5000 points and expect the predictions and anomaly scores of 3.85 million data-points to be accurate. Although the data is not random, it does vary a lot temporally. My best guess in this scenario was to swarm and run the models repeatedly in short interval of times so as to learn new sequences of data properly.

For example, lets say the initial lot is 5000 points (timestamp,hearbeat). I go ahead and run the swarm and run the model. So now I would have an initial set of 5000 anomaly scores. After that lets say I read the next 100 points from the Cassandra Store, I want to again swarm over the entire 5100 datapoints and extract the anomaly scores for the latest 100 points. I want to keep on doing this for lets say every 100 points added to the data-set, is there anyway I could have a system with some confguration settings in NUPIC whereby lets say every 100 data-points, automatically a swarm is run and the model gives me the anomaly scores for the latest set of points. If that it not possible, then my approach would be to build a script which auguments data read from Cassandra, runs a swarm, builds a model,extracts out the anomaly scores of the latest set of points and then again run the swarm for thewhole dataset including the latest set of points. If I have to run scripts for all these activities, then it would be a manual approach, I wanted to know whether through some configuration settings. Is it possible to automate this entire thing.

The other question that I wanted to ask was how do I judge which attributes will be important for me. I was going through Subutai’s wiki page https://github.com/subutai/nupic.subutai/tree/master/swarm_examples
. It says that NUPIC will automatically select the best parameters for me and solve the multivariate problem of deciding which variables to take and which ones not to take. So correct me if I am wrong, I should just go ahead and include all the attributes (54 columns, heartbeat is my predicted field) and the model inherently chooses the best attributes.

I want to set-up an alert system by which if the anomaly scores cross a particular threshold, then alerts would be issued. The problem that I have is that how do I calibrate this threshold level. I was thinking if the anomaly score of an individual data-point crosses the anomaly likelihood, I can flag it as an alert. Is that a reasonable approach or if you have some other good ideas you are welcome to share.

Thanks :smile:


#5

Keep in mind if you want anomaly detection (not prediction), you don’t need to swarm.


#6

No swarming is needed for anomaly detection because it uses preset parameter values, right @rhyolight? It seems that his concern is that the parameter values will become outdated over the course of learning on so much data. Do you think re-running the swarm every x-amount of rows throughout the learning process is a viable way to deal with this?

To my intuition it seems problematic to change parameter values during learning on the same model. Would there be problems similar to those felt when using the adaptive scalar encoder? If I have this right, changing the input bucket ranges associated to each encoder bit on the fly as the adaptive scalar encoder does would require that the model be reshaped to fit the new ‘n’ and ‘w’ parameters every time they change. Re-swarming on the fly would mean changing all model params (not just the encoder ‘n’ and ‘w’) during learning, requiring the model to adapt whatever its learned so far as if the new parameter values were there from the beginning. Also this would seem to retract from the online nature of learning, since re-running the swarm could take a while and knock it out of real time. I’m not sure exactly what your goal is but it would seem that lagging behind real time could be really bad for heartbeat monitoring.

In terms of the multivariate problem the first thing I would do it to simply feed in the predicted field and timestamp without all the others. This will provide a baseline for performance that you could compare with other models that take in more fields. I’ve done this on a couple sample data sets form Riverview and found the models trained on just the predicted field to perform just as well or better without all the noise brought by the other variables. However the swarming process will also search over different combinations of input fields to look for those that most help performance. I think in order to test the most combinations of input fields you need to run a large swarm, which can really take a while. I’d say try doing that, but you’re going to have to swarm over some subset of your data. Running a large swarm over like 4 million data points would take FOREVER.

I hope to help steer you in a good direction but I’m not the foremost expert, so if anyone on the inside would be willing to verify/correct me that’d be awesome. Thanks!

– Sam


#7

Typically, once you have model params that work well for a data set, you never need to change them unless the definition of the data changes. What I mean is that the model will adjust to changing patterns in the data just fine, but the model params may need adjustment if the data changes drastically, like starts exceeding the min/max values of the encoders, or if a data type changes entirely. This means a new swarm might be necessary, or just a simple adjustment to encoder parameters.

Re-running a swarm is not a typical solution to this problem.


#8

Right, it makes sense that you’d have to change the encoder params if all the values coming in were greater than the defined max or less than the min. This would lead the encoder to treat them all as the same value which you obv don’t want.

@csbond007 are you able to check overall min and max values for heartbeat from your big data set? I don’t know anything about this domain, though I’d imagine that there’s a realistic min and max for any human heart rate, like if any person’s heart was beating less that ‘x’ or more than ‘y’ beats per minute they’d pass out. No human heart can beat as fast as a Hummingbird’s or as slow as a whale’s for instance, so I’m thinking there must be some reasonable max value there.

My suspicion is that you may have success setting those mix and max vals and swarming just once over a subset of the data. I know you’re not inclined to this, it just seems to me that our human heartbeats can’t really be all that different from each other in spatial and temporal patterns, which I believe implies that one set of parameters could work fine. Again I know this isn’t what you have in mind, I just think it’d be worth a try and you may be pleasantly surprised. All the best with your cool application,

– Sam


#9

@sheiser1 @rhyolight

Thanks Matt and Sam.

There were 9 subjects to my data, so I have decided to split the data for individual subjects rather than treat all the data in a cohort. There were an avg. 300,000 data-points for each subject. I have decided to split this into a 70:30 manner, by which I would treat 70% of the data for training and 30% of the data for running the model. So I would end up with 210,000 (0.7*300,000) data-points to run the swarm just once and use the model params created to run and spit out the values from the model. Here are the problems that I am encountering right now:

Q1> Running small swarm
This is the file on which I run swarming : https://gist.github.com/csbond007/d27501133a1da4a1ce575b0262a260ea

Swarm Description : https://gist.github.com/csbond007/ee5fc0912be514b9a703fde23033dd3d

Model Params File created : https://gist.github.com/csbond007/895c6ed9f83b453291a8511f9b7f9bc7

‘modelParams’: { ‘anomalyParams’: { u’anomalyCacheRecords’: None,
u’autoDetectThreshold’: None,
u’autoDetectWaitRecords’: None},
‘clParams’: { ‘alpha’: 0.050050000000000004,
‘regionName’: ‘SDRClassifierRegion’,
‘steps’: ‘1’,
‘verbosity’: 0},
‘inferenceType’: ‘TemporalAnomaly’,
‘sensorParams’: { ‘encoders’: { u’V10’: None,
u’V11’: None,
u’V12’: None,
u’V13’: None,
u’V14’: None,

Why are the encoders for all attributes other than the predicted “hearbeat” coming as “None” ?

Q2> Running medium swarm
Model Params generated (It encodes attributed V13 as scalar encoder while the rest are still None)

On running this medium swarm, I get the following error (I haven’t changed any header rows, the only thing that I changed was in the swarm_description from “small” to “medium”

Q3> I see in the run.py file the following code

result = model.run({
“timestamp”: timestamp,
“heartbeat”: heartbeat
})

Why are we only feeding in “heartbeat” which is my predicted field ? Can we not include other useful fields. If so which ones ? (Should it be V13 as from the medium swarm, it gets the encoding so I assume it is important. I am still not totally convinced about this multi-variate case, does the swarming automatically does this for us if we feed in all the attributes.

Q4> Does the time-interval between data-points matter. For example let us say, in the data that I am feeding to the swarm, I have the time interval between successive data-points as 300 millisecs whereas when I run the model, the time interval between successive data-points is just 100 milli-secs. Will this inconsisent time-interval give me poor results or it just learns the sequences and does-not bother about the inherent time-interval between recordings.

Thanks :slight_smile:


#10

Because a “small” swarm doesn’t do much. It only tests that the swarm config is correct and runs one model. Only use a small swarm for debugging your swarm description.

The last thing is the error message is this:

ValueError: Unknown field name ‘V13’ in input record. Known fields are ‘timestamp, heartbeat’.
This could be because input headers are mislabeled, or because input data rows do not contain a value for ‘V13’.

That means the input row field labeled V13 for each row being sent into the model, but the model description is expecting the row to contain only timestamp and heartbeat. I assume you just need to rename the label from V13 to heartbeat.

Based on the model params you posted, the model is only going to pay attention to fields called timestamp, heartbeat, and V13. You could add data for V13 if you have that value:

result = model.run({
  "timestamp": timestamp,
  "heartbeat": heartbeat,
  "V13": v13Value,
})

No, as long as you are providing a date encoding for the data point, which you are.

This time resolution is so small our date-encoder won’t notice the difference. The current DateEncoder in NuPIC is not set up to resolve sub-second or even sub-minute values.

So you could throw out the timestamp completely, but then you would want the intervals between data points to be consistent.


#11

@rhyolight

Thanks a lot Matt. Really appreciate your help. Let me run the large swarm overnight and make the appropriate changes for V13 and then see what happens.
:slight_smile:


#12

@rhyolight

Hi,

I wanted to have an understanding of how the swarming process scales. For my dummy dataset with 5300+ rows, it took 1 min for small swarm, 1 hr for medium swarm and 18 hr for large swarm. For my actual dataset I had around 175,000+ rows per subject (I have 9 subjects,this is only for 1 subject) , it took 36 min for the small swarm. Looks like it just scales linearly which means for the large swarm on this 175,000+ dataset it could take as long as 18*36=648 hrs = 27 days !!!..My machine is a 4 core machine (there are 3 such servers for my use), so I cannot increase maxWorkers beyond 4…Any ideas how I can distribute the work to make it run faster and what could be a reasonable strategy to swarm for all 9 subjects each containing on an avg. 200,000+ rows.

Thanks :slight_smile:


#13

There is no need to run that much data through a swarm. I would only run about 3000 rows of data. That should be enough of any data set for the model to adjust to it.


#14

@rhyolight

Q1> I am curious how only 3000 rows is good enough to create a fine-tuned model params file for a 200,000+ rows data file ?

Q2> I will take your advice and run the swarm for 3000 points but in general how does swarming scale in terms of distributing work across machines or in just a single machine will it just scale linearly meaning if 1000 points take 1 min , then 5000 points would take 5 mins for a single machine ?

Thanks :slight_smile:


#15

2 posts were split to a new topic: NuPIC over HTTP?