Sparse Distributed Representation Class

dmac · March 19, 2019, 10:38pm

Hello HTM-Hackers,

I’d like to take the time to introduce to you a new feature which I’ve added to the community fork of Nupic. Although the majority of of the library is still under development, this piece is ready for the spotlight.

Sparse Distributed Representations (SDRs) are a critical component of all HTM models, and are used throughout Nupic. Numenta has done significant mathematical analysis of SDRs and their properties.

I made a new class SDR for working with Sparse Distributed Representations. The SDR class has methods to hold, measure, and manipulate them. Key features include:

Seamless conversions between all commonly used SDR data formats (sparse, dense, coordinates)
Methods to generate random SDRs & add noise to SDRs
Methods to measure Sparsity, Overlap, Activation Frequency & Binary Entropy
Integration with the other HTM algorithms
Written in C++, usable in python via bindings.

For more information see: https://github.com/htm-community/nupic.cpp/wiki/Sparse-Distributed-Representations

Questions and comments are welcome.

jacobeverist · March 20, 2019, 6:48pm

Very cool! Looks like you put a lot of work into it.

You also might want to consider another option of having bit-level representation on the backend when compactness is important. The sparse representation is still larger than a binary-only dense representation in many cases. Consider an SDR of size 2000 with activation rate of 10%.

dense byte: 2000 = 2000 bytes . ( 1 byte per value )

sparse unsigned int: 2000 * 0.1 * 4 = 800 bytes . ( 4 bytes per unsigned int )

dense binary: 2000 / 8 = 250 bytes . ( 8 bits per byte )

This is something I’ve always wanted but I’ve settled for byte level instead which is 8 times larger. It’s just the bit-from-byte wrangling that I haven’t had the time to implement yet. I’m also not sure if this will add any other performance overhead.

Of course, overlap scores will still need an integer result for each position in the array.

dmac · March 20, 2019, 9:42pm

Yea the implementation of using bits vs bytes is problematic. It would probably reduce the memory usage at little to no performance cost. C++ even has a special vector for boolean values which we looked into but it had a few caveats so we settled for bytes instead.

ddigiorg · March 21, 2019, 2:56pm

This is a pretty clever idea for handling SDR formats. Sparse or dense vectors are better suited for different HTM algorithms so this is a good compromise rather than simply storing both.

I’m unsure if bit-level representations are wise for GPU and CPU applications because you will need some overhead to deal with accessing certain bits at certain addresses whereas just accessing the index of an array of 8-bit values is fast and cheap. Bit-level representations would work extremely well on FPGAs specifically designed to handle them.

dmac · March 22, 2019, 12:09am

Internally, the SDR class actually stores both sparse & dense formats, until either of them are assigned to. When either of the formats are assigned to the other formats are invalidated and recalculated from the latest data when they’re accessed.

Previous Nupic would store only one of the formats (which one was somewhat arbitrary) and frequently converted between them because it often did not have the right format for the job on hand.

breznak · March 27, 2019, 10:56am

This is something I’m also eager to experiment with, we consider bit-vectors, even using half (instead of float32) etc.

This PR has 2 nice things for you

nupic.cpp uses only Connections as its backend (so SP, TM, …) so there one place to do these optimizations that propagates globally
this PR adds typedefs for specific data-types, so you can easily do these experiments

As David is saying, the memory, and esp cache footprint can be improved, but the CPUs are optimized for 32bit aligned arithmetics. My experiment showed using unsigned char, instead of uint32 had actually worse performance.

2 optimization paths:

(auto) vectorization:
- we don’t do any specific work to vectorize, so we rely on compilers’ AUTO vectorization capabilities
- http://0x80.pl/notesen/2019-02-02-autovectorization-gcc-clang.html
  shows int32 is a safer bet that 8bit variant
- https://www.andrew.cmu.edu/user/mkellogg/15-418/final.html
  this, or Eigen could be a good base for SIMD vectorized data-types.
  Investigate if eigen (or similar approach) can use used as the backend in SDR to improve performance (currently SDR holds vectors)
CPU cache footprint
- smaller data-types would mean more fits in CPU’s cache, but see above SIMD
- one trick I want to try is using half (float16) for Permanence (now float32) in Connections
  - half / Code / [r419] /trunk
  - half has a number of nice features: we actually don’t need that much precision
  - current CPUs don’t support float16 (but that might change), which will hurt any gains

barnettjv · July 9, 2019, 5:07am

Hi dmac, I tip my hat to the work that you’ve done here. Apology in advance for asking too many “noob” questions up front. I figure I’m allowed 3 before I get on anyone’s’ nerves

As far as coding style is concerned, why the use of so many typedefs? A char is a char is a char. Was it a personal choice based on your style or was there (as I suspect) some compatibility reason with java script or python?
Do you think there would be any benefit to using Boost’ bitset data structure as opposed to a vector of char/uints?
I’m gonna save this one for desparate times.
Thank you in advance for your response.
s/
Jeff

dmac · July 9, 2019, 3:57pm

Hi Jeff,

Probably, but writing code with them is more difficult than std::vector<char>

So that maybe someday this code can be converted to use a bitset instead of a std::vector<char>.

Your welcome

Topic		Replies	Views
Sparse Distributed Representations Numenta Theory sdrs , nupic-wiki	0	4337	April 6, 2017
Developing a better NuPIC. Need suggestions Implementations htm-implementations , community , projects	17	1532	July 4, 2018
Sparse Distributed Representations Quiz Numenta Theory sdrs , quiz , nupic-wiki	1	1390	April 10, 2017
Generative SDRs Tangential Theories	1	793	July 21, 2016
Wondering about the history of HTM and SDR Education	18	1651	January 21, 2021

Sparse Distributed Representation Class

Related topics