Developing a better NuPIC. Need suggestions



Hi all NuPIC developers,
I have been using NuPIC.core for a while now and its a very powerful library. Yet I find a lot of problems inside it. Such as

  • Using vectors instead of some sort of tensor to store SDRs
    • The shape information need to be stored else where explicitly
    • Very annoying when access individual bits
  • Inconsistent API (especially Cells4 vs TemporalMemory)
  • Cells4 cannot be copied
  • Storing data as sparse matrix seems to be a bad idea (bad locality, high time complexity and not saving memory if SDR density is high)

Which most of them are problems within the fundamental structure of NuPIC,core. And thus I am developing a new library that works and feels better then NuPIC.core in my spare time. Which hopefully will accelerate the development of HTM based systems.

Here are some features I’m looking forward to implement

  • Using proper tensor library instead of vertors (planing on xtensor, Halide or TVM)
  • Consistent API
  • GPU support (over TVM or Halide)
  • Keras/PyTorch like (Functional) API
  • Serialize to MessagePack

Any other complains regarding NuPIC.core and suggestions on what else I can improve?


I for one would love the ability to easily define compositions of layers as one-line per layer, akin to Keras/PyTorch. Then an intro to nupic could be less intimidating to newcomers (which would probably help spread adoption of HTM-like systems).

Also like the idea of tensors. It’s already an established thing that seems to work in well over and over again in Deep Learning… why not let’s borrow a working wheel?


I too built my own utilities to manage SDRs. I will describe my solution:

I created a python class called “SDR” which is initialized with the dimensions for the SDR. The SDR class has three magic attributes:

  • sdr.dense - This is an array of boolean true/false values, one bit per location in the space.
  • sdr.index - This is a tuple of vectors of indices into the space, one vector in the tuple per dimension.
  • sdr.flat_index - This is a vector of indices into the flattened space.

Assigning to any of these magic attributes overwrites the SDR’s current value with a new value. After a value has been assigned to the SDR, it is available in all three formats. The format conversion is done when the attribute is accessed and the results are cached in case they’re used more than once.

Magic attributes are accomplished using pythons “@property

Over time, my implementation has accreted other useful features.
Unfortunately, I have an empty directory where I keep my unit tests…
My implementation can be found at:


I too had started trying to create my own SDR class.
I decided to forgo the Binary option, but kept the 2 other options, flat index and dimensional.
At the core, i used python sets to store just the flat_index of the ‘on’ bits. With SDR, the more sparse it is, the less data the set has to store. However, it loses some performance for dimensional indexes because they are calculated on-the-fly.
The advantages to this storage method are that you can use set intersection, union and “in” for comparison, manipulation and searching.

Unfortunately, I stalled out in my implementation so there is no working code I can share. Just ideas.


See also: & Other HTM projects for a list of HTM implementations that exist (in some form) already.


Just to be clear on what I mean “Storing data as sparse matrix seems to be a bad idea”.
I’ve written a benchmark in C++ comparing how fast it is to perform SDR union with BinarySDRs (sroring SDR as an array of unit_8) vs a IndexSDR (storing the index of on bits). The result is astonishing (I’m not expecting such speed difference). BinarySDR is around 1100x faster.

(Benchedmark at, SDR size = 8192, density = 0.5)
Source code avliable here:


So I found out that using logical or instead of binary or is alot slower. Thus iflogical or is used; the perfoemance difference dropes from 1100x to 220x. Yet, still way to much to ignore.


Sparse arrays emphasize compact storage and smaller memory footprint. Always a tradeoff between memory and processing speed.

This is one of the defacto experts on the matter: Daniel Lemire

Here is a link to some of his subject matter:

There are also white papers he’s written about it:


your benchmarked density is set to 0.5, which does not seem like a very ‘sparse’ DR. The more sparse, the more advantageous the indexed representation.
you may also have a copy of the resulting sdr::set in your loop, impairing the indexed case.
(as a side note, the binary version size is 8 times too large, and 8 times sparsier… but that won’t change much of anything anyway)

Test with 2% density results were more like, 57x faster for binary.

but more generally, @cogmission has it. Mem vs Speed is often a tradeoff. Also bear in mind, Mem begins impacting speed in a very significant way once in a real application and dealing with caching issues.
Anyway. More compact representation will have the edge once you need to serialize it. So, each of these structures will have its use case.


Thankyou, I should have set the density to 2%. Initially I want to bash the bash predictor more. I see why it is a mistake.

Anyway I also ran another test using vectors and linear search. Totally remving any data copying, and it’s around 14x slower. ( I know my binary SDRs are 8 times larger than what they should be. But that’s more realistic since TVM and xtensor can’t compress boolean arrays right now.

As for memory effieceny and caching issues. I have some oppositions. Firstly, storing indexes and doing a union operation require to perform a buch of linear seach. Breaking data locality. Secondly, Linear search requires way more code than doing binary or in a loop. Thus cannot fit in the L1 cache and possibly the L2 cache; making it slow. Thirdly sotring index as 32bit intengers (Let’s assume 16 bits is too tight). The maximal density a SDR can have inorder to save space is 3.1%. Higher than the recommendation of 4~5% in the HTM School series. And the limit becomes 1.6% when the indexes are stored as 64bit values.

Yet storing data as 32 bit integers at the density of 2% is 12.5 times more space efficent then storing as 8 bit booleans. Maybe the trade off is worth it? I need to rethink about it.


What you could try is benchmarking the impact of unpacking an indexed representation to binary, before using binary itself for computation.
Then, indexed representation fits better in a vector instead of a map set, as no random lookup is required.

(Note: 16b does not seem too tight for most purposes… you have larger than 64k SDRs ? … you could view a larger one as a concatenation of 64K-max ones anyway)


There definatelly will be a case where more then 16bits is needed. For example. (in NuPIC.core) in the Cells4 layer although a entire column is either on or off. The individual neuron still output a bit to the SDR. i.e. sending a SDR of length 10 to a Cells4 layer with column length of 16 results in a SDR of length 160.

I don’t know why it is. But that’s the case. Thus sending a SDR of length 4096 into a Cells4 layer of column depth 32 will result in a SDR of length 65536, exceeding the maximal value of a 16bit integer.


This is arguably one very large setup. Anyway, what I’m saying is that you can always view larger ones as a concatenation of 16bits-indexed chunks.
something like this:

std::vector<std::vector<unsigned short>> chunkAs16bIndexed(const std::vector<size_t>& indices, size_t maxIndex) {
	static const size_t chunkSize = 1<<16; 
	static const size_t chunkMask = chunkSize-1; 

	size_t chunkCount = 1 + (maxIndex >> 16);
	std::vector<std::vector<unsigned short>> result;

	for(auto it = indices.begin(); it != indices.end(); it++) {
		size_t index = *it;
		size_t chunk = index >> 16;
		unsigned short indexInChunk = (unsigned short)(index & chunkMask);
	return result;

Maybe that seems a lot of work for no good benefits, but if the point of the indexed stuff is, eg, packing for network transfer, that could make a difference:

a 1048576-cells bitwise representation holds in 128KB
a 1048576-cells SDR at 2% density, indexed 32bits holds in around 82KB.
a 1048576-cells SDR at 2% density, chunked into 16x indexed 16bits holds in around 41KB.

Seems like compression would increase, the smaller the index footprint, right ? But note that trying to use that chunking scheme towards 8b-indices will give you too much vector-boilerplate-padding around too few meaningful points, for this kind of density values. So. 16b looks like a sweet spot for most purposes.

Although there could be other sweet spots if you were to renounce to std::vector usage and bit-pack things in a homemade array… something like 12b-indices (allowing 4096 cells in one go) would be my best guess.
Maybe also plainly zipping the thing would be better than whatever I can come up with anyway… Benefit of chunking is to be still usable almost out-of-the box.


Just looked at ‘Roaring’ presentation. Interestingly they use 16b chunks :stuck_out_tongue:

After reading this, I’m realizing that a RLE could be a perfect fit for our SDRs, since we have some precise idea of our sparsity and can optimize its footprint.
Something like,

  • ‘0’+5b -> step of 32 to 63 zeroes until next ‘1’ — this is the most common case for our ~50 cells spacing on average. And holds on 6b.
  • ‘10’+5b -> step of 0 to 31 zeroes until next ‘1’ — holds on 7b
  • ‘110’+6b -> step of 64 to 127 zeroes until next ‘1’ — holds on 9b
  • ‘1110’+9b -> step of 128 to 639 zeroes until next ‘1’ — holds on 13b, presumably less common
  • ‘1111’+16b -> remaining cases of very large, 640+ zeroes until next ‘1’ — holds on 20b, very uncommon

(of course mostly for serialization purposes… @marty1885’s benchmark results show this is a no-brainer when it comes to actually perform computations on SDRs)


Performance is a big issue.

IMO, GPUs will be inefficient with HTM (not enough computational work --FP operations–, to compensate
the architectural limitations of the GPUs).

Concurrency need to be a first-class citizen. Thread Pool design pattern (or coroutines) can help a lot without adding much complexity.


Maybe. GPUs are not efficient when dealing with non-sequential computations. That’s why I propose to use binary SDRs and to use of TVM. Binary SDRs make the computation of HTM a lot more sequential and TVM allows us to execute HTM algorithms on not only GPUs but also TPU and FPGA (not saying it makes thing faster).

As for a thread pool… Why thread pool? Both OpenCL and CUDA provides NDRange; executation; which most likely are thread pools behind the scene.


I’m not very fond of Thread Pools as a solution.
But well managed parallel processing of, say, independent parts from an array of cells, could do. Between well identified execution points.
IIRC latest C++ standard additions allow std::parallel and such for a portable way to address these concerns.


The GPU is (more or less) a gigantic vector unit: mostly exploit SIMD. What they are doing is:

1.- Gather From memory
2.- SIMD compute
3.- Scatter to memory

If your compute part is small in the total program (Because the size of “composed” vector is small or because the computational cost of the operation is small), you can’t amortise the cost of memory accesses. In a GPU the memory hierarchy is quite “weak” (because in its usage scenarios the SIMD part is huge). In HTM SIMD cost is really small. TPU is a “specialised” unit in performing gigantic SIMD fused multiply add (FMADD). Perhaps FPGA… but i think that the memory requirements are huge for that.

My point is that HTM algorithms are not SIMD friendly… but the are lots of opportunity for MIMD! (true concurrency). For example if you use std::async() and synchronise via barriers (with std::futures) you can split the system in two independent threads. As trivial case you can run SP and TM is two threads (if you have only one region). This is like a few lines of code in your code (just as example, i don’t remember nupic.core API ):

t_SDR spOutput; //both regions are pipelined
std::vector< std::future<void> > barrier;
barrier.emplace_back(std::async(std::launch::async,&computeSP(),inSP, spOutput....));
barrier.emplace_back(std::async(std::launch::async,&computeTP(), spOutput, tmOutput....));
for ( auto &&done:barrier )

In the backstage the runtime is using a pool of threads to avoid thread creation destruction overheads: the thread is just waiting (in a a conditional var) for your std::async call. Beware, libstdc++ seems to be a bit slow at it. A “personalised” pool seems to perform better.

With coroutines this might be even simpler since SP-TM (producer-consumer) can interact directly. Unfortunately coroutines is part of C++20.


That might require to change the core algorithms. TP can be done with touching base code.