Serialization sizes :(

We are using capnproto to serialize the HTMPrediction model with the method below:

def writeToCheckpoint(self, checkpointDir):
    """Serializes model using capnproto and writes data to ``checkpointDir``"""
    proto = self.getSchema().new_message()

    self.write(proto)

    checkpointPath = self._getModelCheckpointFilePath(checkpointDir)

    # Clean up old saved state, if any
    if os.path.exists(checkpointDir):
      if not os.path.isdir(checkpointDir):
        raise Exception(("Existing filesystem entry <%s> is not a model"
                         " checkpoint -- refusing to delete (not a directory)") \
                          % checkpointDir)
      if not os.path.isfile(checkpointPath):
        raise Exception(("Existing filesystem entry <%s> is not a model"
                         " checkpoint -- refusing to delete"\
                         " (%s missing or not a file)") % \
                          (checkpointDir, checkpointPath))

      shutil.rmtree(checkpointDir)

    # Create a new directory for saving state
    self.__makeDirectoryFromAbsolutePath(checkpointDir)

    with open(checkpointPath, 'wb') as f:
      proto.write(f)

When we look at the size of the serialized files, the total is over 9 MB, if we use to_bytes() instead of to_bytes_packed() the size is 27 MB.

Can anyone explain why the size of the serialized buffer is so large? I just want to point this out in case something is wrong? Can anyone confirm this is working as it’s supposed to?

1 Like

What are the model parameters you used to create this model? And how many rows of input data has it seen?

We are using the same model params from the hot gym example. We also used swarming and tried a lot of different model params, they don’t seem to change the result of the size.

But to clarify, same results with hot gym params. If this isn’t working how it’s supposed to please let me know, I will start looking into that code too.

It still depends on how much data the model has seen. It doesn’t surprise me that the model is that large if it has been running a long time and seen a lot of data. If that is unacceptable, you might be able to trim up some random segments before serialization and still retain the behavior you want.

The size measured is after it has only computed 300 samples.

I also experienced that models are too large.
You can make a test, compress the model. If it gets much smaller, it’s nupic fault, if it’s ± the same, you can’t do anything and the data is just that large.

@breznak

I was curious so I did compress the raw bytes with zlib. A 8.99 MB model was compressed to 1.91 MB

What do you think @scott?

To be fair, if it’s a new model, there’s gonna be lot of zeros. Try with a learned (300, 1000 transactions)…

1 Like

I just talked to Scott, and he thinks that 10MB in size is typical.