I’d like to take the time to introduce to you a new feature which I’ve added to the community fork of Nupic. Although the majority of of the library is still under development, this piece is ready for the spotlight.
Sparse Distributed Representations (SDRs) are a critical component of all HTM models, and are used throughout Nupic. Numenta has done significant mathematical analysis of SDRs and their properties.
I made a new class SDR for working with Sparse Distributed Representations. The SDR class has methods to hold, measure, and manipulate them. Key features include:
Seamless conversions between all commonly used SDR data formats (sparse, dense, coordinates)
Methods to generate random SDRs & add noise to SDRs
Methods to measure Sparsity, Overlap, Activation Frequency & Binary Entropy
Very cool! Looks like you put a lot of work into it.
You also might want to consider another option of having bit-level representation on the backend when compactness is important. The sparse representation is still larger than a binary-only dense representation in many cases. Consider an SDR of size 2000 with activation rate of 10%.
dense byte: 2000 = 2000 bytes . ( 1 byte per value )
sparse unsigned int: 2000 * 0.1 * 4 = 800 bytes . ( 4 bytes per unsigned int )
This is something I’ve always wanted but I’ve settled for byte level instead which is 8 times larger. It’s just the bit-from-byte wrangling that I haven’t had the time to implement yet. I’m also not sure if this will add any other performance overhead.
Of course, overlap scores will still need an integer result for each position in the array.
Yea the implementation of using bits vs bytes is problematic. It would probably reduce the memory usage at little to no performance cost. C++ even has a special vector for boolean values which we looked into but it had a few caveats so we settled for bytes instead.
This is a pretty clever idea for handling SDR formats. Sparse or dense vectors are better suited for different HTM algorithms so this is a good compromise rather than simply storing both.
I’m unsure if bit-level representations are wise for GPU and CPU applications because you will need some overhead to deal with accessing certain bits at certain addresses whereas just accessing the index of an array of 8-bit values is fast and cheap. Bit-level representations would work extremely well on FPGAs specifically designed to handle them.
Internally, the SDR class actually stores both sparse & dense formats, until either of them are assigned to. When either of the formats are assigned to the other formats are invalidated and recalculated from the latest data when they’re accessed.
Previous Nupic would store only one of the formats (which one was somewhat arbitrary) and frequently converted between them because it often did not have the right format for the job on hand.
This is something I’m also eager to experiment with, we consider bit-vectors, even using half (instead of float32) etc.
This PR has 2 nice things for you
nupic.cpp uses only Connections as its backend (so SP, TM, …) so there one place to do these optimizations that propagates globally
this PR adds typedefs for specific data-types, so you can easily do these experiments
As David is saying, the memory, and esp cache footprint can be improved, but the CPUs are optimized for 32bit aligned arithmetics. My experiment showed using unsigned char, instead of uint32 had actually worse performance.
2 optimization paths:
(auto) vectorization:
we don’t do any specific work to vectorize, so we rely on compilers’ AUTO vectorization capabilities
https://www.andrew.cmu.edu/user/mkellogg/15-418/final.html
this, or Eigen could be a good base for SIMD vectorized data-types.
Investigate if eigen (or similar approach) can use used as the backend in SDR to improve performance (currently SDR holds vectors)
CPU cache footprint
smaller data-types would mean more fits in CPU’s cache, but see above SIMD
one trick I want to try is using half (float16) for Permanence (now float32) in Connections
Hi dmac, I tip my hat to the work that you’ve done here. Apology in advance for asking too many “noob” questions up front. I figure I’m allowed 3 before I get on anyone’s’ nerves
As far as coding style is concerned, why the use of so many typedefs? A char is a char is a char. Was it a personal choice based on your style or was there (as I suspect) some compatibility reason with java script or python?
Do you think there would be any benefit to using Boost’ bitset data structure as opposed to a vector of char/uints?
I’m gonna save this one for desparate times.
Thank you in advance for your response.
s/
Jeff