Save/load the model in cloud storage

Hi I am trying to build a distributed version of hotgym, the first thing I want to do is to put the model in a cloud storage so that all the computational nodes can access it. I see there has been some effort on serializing the model into capnp format, but the CLAModel has not yet being ported to use the capnp format. From what I observed, it’s still using the pickle format. Taking hotgym as an example, the model is saved as the following:

model.pkl --> which is the pickled model object
modelextradata --> folder to store extra data of the model
– network.yaml --> a network structure yaml file
– R0-pkl --> pickled RecordSensor object
– R1-pkl --> pickled SPRegion object
– R2-pkl --> pickled TPRegion object
– R3-pkl --> pickled CLAClassifierRegion object
– R4-pkl --> pickled KNNClassiferRegion object

if I have ~10k data points, the content of modelextradata folder is about 57Mb, which means loading and saving the model to the storage is super heavy. Anyone has an idea how to tackle this issue?

1 Like

From Numenta’s prior experience at running a distributed service, we used S3 for the uploads. Uploads to S3 may be parallelized, which increases the uploading throughput significantly.

Thanks for the reply. Is there a code sample for uploading to s3? Also how much does the model and the extradata grow when there are 10k, 100k, 1M, etc… data points fed into the engine?

here is something I experimented:

data points model size

4.3k 43MB
13k 57MB
100k 67MB
4.4M 122MB

The models do take up a good amount of disk space / memory. If you can limit the amount that you pass models over the network then that would minimize the issue. If you are copy a model across the network each time you feed a record through it then you will almost certainly have network issues. It’s better to take the computation to the model rather than the model to the compute.

1 Like

I see. and how would the model grow if the engine gets infinite number of data points?

The models are intended to be fixed resources. Many parts allocate memory when it’s needed, resulting in the size growing asymptotically towards some limit. You should be able to feed random values in and get a good idea of how big it will get for your parameters after a couple minutes.

At one point, one of the new TM implementations grew unbounded because it did not clean up segments. Not sure if that is still a problem anywhere. @mrcslws?

So the algorithm is intended to expire the outdated data (by clean up old segments) to keep the resources bounded, do you know how large a model can be?

We might not have a code sample for parallelized upload to S3, but googling for “parallel upload to s3 boto” yields a number of hits.

I use joblib library for compressing my sklearn models. It’s more efficient at handling numpy style matrices.
https://pythonhosted.org/joblib/generated/joblib.dump.html

For uploading to s3, you can use boto, s3 browser, the aws cli or some more basic Python.

Like Scott said:

Yes, I actually tried the following:

number of data points model size
4.3k 43MB
13k 57MB
100k 67MB
4.4M 122MB

But I still don’t know if there is an upper bound for the model size or not.

If you pushed 4.4 million random inputs into the model, I think what @scott is saying is that’s as large as it can possibly get. So you’ve probably found the upper bound for the model you created at 122MB. However, if you change the algorithm parameters, that will change (adding columns and cells for example).

If you plot your few sample points it looks like it may still be growing but it is hard to tell. The reason that I recommended feeding random data is that if you are feeding learned sequences then it may not be adding new segments so it may appear to stop growing. But if later there are new patterns then it will continue to grow. But if you are feeding in new patterns and the growth starts tapering off then you have a good estimate.

1 Like