Save/load the model in cloud storage

rainyyun · August 3, 2016, 6:04pm

Hi I am trying to build a distributed version of hotgym, the first thing I want to do is to put the model in a cloud storage so that all the computational nodes can access it. I see there has been some effort on serializing the model into capnp format, but the CLAModel has not yet being ported to use the capnp format. From what I observed, it’s still using the pickle format. Taking hotgym as an example, the model is saved as the following:

model.pkl --> which is the pickled model object
modelextradata --> folder to store extra data of the model
– network.yaml --> a network structure yaml file
– R0-pkl --> pickled RecordSensor object
– R1-pkl --> pickled SPRegion object
– R2-pkl --> pickled TPRegion object
– R3-pkl --> pickled CLAClassifierRegion object
– R4-pkl --> pickled KNNClassiferRegion object

if I have ~10k data points, the content of modelextradata folder is about 57Mb, which means loading and saving the model to the storage is super heavy. Anyone has an idea how to tackle this issue?

vkruglikov · August 3, 2016, 6:18pm

From Numenta’s prior experience at running a distributed service, we used S3 for the uploads. Uploads to S3 may be parallelized, which increases the uploading throughput significantly.

rainyyun · August 3, 2016, 6:39pm

Thanks for the reply. Is there a code sample for uploading to s3? Also how much does the model and the extradata grow when there are 10k, 100k, 1M, etc… data points fed into the engine?

rainyyun · August 3, 2016, 10:31pm

here is something I experimented:

data points model size

4.3k 43MB
13k 57MB
100k 67MB
4.4M 122MB

scott · August 3, 2016, 11:12pm

The models do take up a good amount of disk space / memory. If you can limit the amount that you pass models over the network then that would minimize the issue. If you are copy a model across the network each time you feed a record through it then you will almost certainly have network issues. It’s better to take the computation to the model rather than the model to the compute.

rainyyun · August 4, 2016, 12:20am

I see. and how would the model grow if the engine gets infinite number of data points?

scott · August 4, 2016, 12:36am

The models are intended to be fixed resources. Many parts allocate memory when it’s needed, resulting in the size growing asymptotically towards some limit. You should be able to feed random values in and get a good idea of how big it will get for your parameters after a couple minutes.

At one point, one of the new TM implementations grew unbounded because it did not clean up segments. Not sure if that is still a problem anywhere. @mrcslws?

rainyyun · August 4, 2016, 12:53am

So the algorithm is intended to expire the outdated data (by clean up old segments) to keep the resources bounded, do you know how large a model can be?

vkruglikov · August 4, 2016, 4:54am

We might not have a code sample for parallelized upload to S3, but googling for “parallel upload to s3 boto” yields a number of hits.

jonincanada · August 4, 2016, 2:33pm

I use joblib library for compressing my sklearn models. It’s more efficient at handling numpy style matrices.
https://pythonhosted.org/joblib/generated/joblib.dump.html

For uploading to s3, you can use boto, s3 browser, the aws cli or some more basic Python.

rhyolight · August 4, 2016, 2:49pm

Like Scott said:

rainyyun · August 4, 2016, 5:18pm

Yes, I actually tried the following:

number of data points model size
4.3k 43MB
13k 57MB
100k 67MB
4.4M 122MB

But I still don’t know if there is an upper bound for the model size or not.

rhyolight · August 4, 2016, 5:20pm

If you pushed 4.4 million random inputs into the model, I think what @scott is saying is that’s as large as it can possibly get. So you’ve probably found the upper bound for the model you created at 122MB. However, if you change the algorithm parameters, that will change (adding columns and cells for example).

scott · August 4, 2016, 5:43pm

If you plot your few sample points it looks like it may still be growing but it is hard to tell. The reason that I recommended feeding random data is that if you are feeding learned sequences then it may not be adding new segments so it may appear to stop growing. But if later there are new patterns then it will continue to grow. But if you are feeding in new patterns and the growth starts tapering off then you have a good estimate.

Topic		Replies	Views
Model Saving NuPIC	1	587	August 28, 2017
Load/Save model may lose precision? NuPIC question , serialization	5	1284	August 29, 2016
Load whole model from pickle NuPIC pickle	8	1064	November 23, 2019
Serialization sizes :( NuPIC question	9	724	January 25, 2018
A question about the hotgym application NuPIC sequence-memory	3	706	November 16, 2016

Save/load the model in cloud storage

data points model size

Related topics