Pearls from the ML

Tidbits that should not be lost, but we’re too lazy to properly classify

Encoder’s #ON bits, (Subutai)

[ w >= 3] …A reasonable value though would be much higher. We discussed this quite a bit at Grok over the years. We use a value of 21 in the OPF/Grok. The reason for this is that you need a sufficiently high number of bits on to have columns with a solid match and really be able to discriminate inputs. If you used a value of 3 and just one field, there would be tons of columns with overlap=3 and the winner would basically be random. A small shift in the input will cause a large change in the SDR output. With higher w’s, the winners will really match the input well and a small change in the input should cause a small change in the output.

NuPIC API (Subutai)

Algorithms - This level contains implementations of the Spatial and Temporal Poolers. If you just want to work with the raw algorithms, this is the easiest level to use. The file spatial_pooler.py contains a clean implementation of the spatial pooler that can be used directly. See hello_tp.py for how to use the temporal pooler directly. Matt used this for his nupic_nlp implementation.

Network API - This level formalizes the concept of “Networks" and “Regions". This API allows you to string together multiple regions, including hierarchies. You can send the output of N Regions into higher level Regions. It formalizes initialization, input/output vectors, a unified mechanism for setting/getting parameters, serialization, and the order that compute is called on individual regions (this is very important for hierarchies). It is one level above the algorithms and is agnostic to the specific algorithm. For example, a Region can have a CLA implementation or a KNN implementation. It is 100% written in C++, and it is very small and clean. It can support multiple language bindings, with Python being the main one currently implemented.

OPF - the Online Prediction Framework is a client of the Network API and used in Grok’s commercial product. It is designed for a very specific use case: small streaming data applications. The OPF contains three years of exploring dozens of different industries and business models while we searched for commercial applications of the CLA. At least 40% of this code is not used anywhere anymore. We haven’t had time to clean it up. I would characterize this code as very powerful, very messy, and not easy to understand. The OPF includes encoders, the classifier, swarming, and the description file format. All this stuff is pretty specific to streaming data applications such as energy, IT data, etc. It does not support hierarchies, vision, and so on. The hotgym sample is an example of using the OPF.

Support - I include this because there is also a grab bag of other components that are used in various places. This would include the Sparse math library, test routines, the build system, etc.

So what is the NuPIC API? My strawman proposal: the main API for NuPIC should be the Network/Regions API. It is very generic and clean. It is independent of specific algorithm implementations, and can support a very wide range of use cases including streaming data, vision, audio, hierarchies. It can also easily support other languages. It can support experimentation and commercial uses.