New Serialization Plan

Serialization is an important part of NuPIC. Saving trained models is important for sharing in the community and also for many potential applications. The important aspects are:

  • Speed - we want to be able to save to and load from disk as fast as possible.
  • Durability - a model should be completely identical after deserialization to the model that was saved. In some cases there are data structures in memory for optimization purposes that do not need to be included in the serialized state. This is OK but the behavior of the pre- and post-serialized instances should be identical.
  • Compatibility - we want a format that is easy to maintain as it evolves (old saved models should still be able to be loaded) and works across languages.

New Format - Cap’n Proto

The new format uses Cap’n Proto. It is usually implemented in read and write methods. These methods should primarily be implemented so they take Reader or Builder instances are arguments but can additionally be implemented to take iostreams/file descriptors (C++) or file objects (Python). It is currently integrated but is not yet implemented for all algorithm components (#1449).

Schema

Cap’n Proto requires data formats be specified in .capnp files. These should be placed in $NUPIC_CORE/src/nupic/proto/ (for C++ or shared files) or within the nupic Python package (for Python-only files). The nupic.core versions will be included in the nupic build at $NUPIC/nupic/bindings/proto/. And to aid portability, imports in nupic.core should be relative.

After adding a new schema file to nupic.core, add it to src/CMakeLists.txt in the section titled “Generate Cap’n Proto C++ files” to have it be included in the build.

See details on writing schema in the Cap’n Proto documentation.

Conventions

Please append schema names with Proto (like SpatialPoolerProto). There is an issue limiting our ability to wrap generated C++ code in namespaces so it is important to append with Proto to avoid collisions with the corresponding C++ classes. The schema files should generally be named the same as the primary schema in the file. And sub-structs that don’t need to be exported can be nested.

Also pay attention to data types. You may need to explicitly cast values when deserializing in Python. Notably, there is an issue with MSVC that requires you to explicitly cast string types.

Data Types

In general, use the minimum size needed but don’t be too aggressive. For general purpose numeric values, the 32 bit versions usually make sense. This may cause some slight change in the value after deserialization since Python (and some C++) attributes are stored using 64 bits.

Sample Integrations

Current/Old Custom Format

The current format is implementation-specific. It uses a combination of Python pickling (cPickle module) and direct file writing in C++. Models using different implementations of the classifier or other components will not have the same serialization format and can only be deserialized into the same implementation. This format is usually implemented in save and load methods. We hope to deprecate this method in early 2015.