Performance optimizations

A highly desired topic for many of us.
I’ve started the work in

Related discussions:


  • [ ] set baseline benchmarking tests, the more, the better
    • micro benchmarks
    • IDE profiling
  • [ ] refactor code to use shared, encapsulated class for passing around data, “SDR type”
    • for now it could be typedef UInt*,
    • later wrap vector, add some methods,
    • even later wrap opt-Matrix type required by the library,…
  • [ ] identify bottlenecks
  • [ ] compare math library toolkits
    • the library have their data type (EIgenMatrix, etc)
    • converting to/from it will kill the (gained) performance -> “SDR type”
  • [ ] iterative optimizations


  • what we want from the library?
  • max speed
  • multi-platform
  • sparse (memory efficient)
  • big user-base, popular
  • low code “intrusiveness”
  • CPU backend (SSE, openMP)
  • nVidia GPU backend (CUDA)
  • AMD GPU backend (openCL)
  • generic GPU backend (both AMD, nV; likely on openCL?)
  • open source
  • clean & lean API (ease of use, high level)
  • bindings/support for other languages (python,…)
  • used in other AI/ML frameworks (TensorFlow, SCIKIT.learn, torch,…)
  • I don’t need no optimizations

0 voters

Considered toolkits:


1 Like

As an alternative way of optimization, I came up with

I think the vote is missing the number one priority and that is “easy to understand”.

That one is “low code intrusiveness”~ minimum number of changes needed in the current code to implement the optimizations. or it could also be “I don’t need no optimizations”)

@chhenning do you have some say for Eigen (vs Blaze) as pybind11 has support for Eigen?

Also, I’m going to do some refactoring in encoders,SP,TM (maybe touch regions) regarding introducing a common typedef for “SDRvector”, I should probably coordinate this work with your removal of some classes (after you have the PR).

@breznak I’m very open for discussing the selection of the right matrix lib. My thinking is that we need a proper benchmark to make the right choice. I’m willing to work on that.

1 Like

proper benchmark to make the right choice. I’m willing to work on that.


  • bin/connections_performance_tests which show some numbers
  • Ithink we could do micro benchmarks just for some specific methods as we try to change them
  • a good overall bench is the “Hotgym anomaly example”, I don’t know if it’s in Py only, or if there’s cpp variant.
    • we could either port it, or run from Py with cpp bindings.
    • I think it reflects very well the complex use of HTM (stresses almost all parts)

I’m ready with unit tests, so we have free hands to start experimenting!
I plot some plans here

  • would like to start with the “common type for regions I/O”, and vectorization… Your pybind changes don’t touch the individual files in src/nupic/algorithms/ , do they?

do we know of more people interested in this change who would be willing to contribute? As I can prepare/refactor the stuff above, but I don’t have much actuall exp with the graphics/math libraries. So learning without guidance would take me longer…

Can you provide a link?

I’m fairly certain that’s only python. But for now python is the main user of nupic.core and maybe the benchmark should take that into account.

Anythin python is limited to the /nupic/python folder. There are some python remnants in the engine code but that hopefully will disappear soon.

So, no, no pybind stuff in the algorithms.

1 Like

Good point,
We should have the releases now

And the bindings
So we could use the Py code that would itself call the c++ bindings - the most real-world tests.

I also have some private code for running the Hotgym in c++, so I’ll clean it up and we can set up pure c++ benchmarks (imho easier and more fool proof game).

That’s sth I’d like to understand in the PR, do you remove the Regions and NAPI “framework”/functionality? Sorry to being dense on your changeset, I just need to understand the direction.

Cool, so other tasks can go in parallel with the Py3 work!

Here are some more details:

  1. To be exact I have not removed “Regions”. The only region I have reimplemented is PyRegion using pybind11 functionality. I have named that class PyBindRegion…

  2. What I have further done is remove nupic’s py_support folder.

  3. Right now the engine is an integral part of nupic.core and supports the inclusion of python regions, like “” for instance. For that I need to use embedded python via pybind11.

  4. Until we refactor the “engine” we’ll have to build nupic.core with pybind11 and therefore python c api headers.

I hope this makes sense.

1 Like

Thank you. That sounds very sensible and well designed. I was worried you just scratched the Region* classes :+1:

Funny the description for dlib is wrong in awesome-cpp. It should be:

“A toolkit for making real world machine learning and data analysis applications in C++”

Hi all,

I think I know how make topology in the spatial pooler run a lot faster. I call it poor mans topology, use many small spatial poolers arranged over a topological area, and each spatial pooler has global inhibition. It’s not a perfect solution but it should run as fast as the no-topology case.

I have experimented with this method with my own HTM implementation with success on the MNIST dataset. I use a (10 x 10) grid of spatial poolers, each with 100 mini-columns, and an input radius of about 2.8 pixels for a total of 106 potential synapses to each mini-column. It takes 4 minutes to run through the MNIST dataset and it scores ~95%.

Any thoughts? Is this something you’d all want in the community fork?

1 Like

Score 95% is comparable to other HTM experiments …

1 Like

nupic.cpp brought a lot of performance optimizations, partly due to cleaned up code, and mainly thanks to Connections optimization by @dmac !
Another great feature is that Spatial pooler (SP) is now using the optimized Connections structure (moved from SparseMatrix) which is shared with Temporal memory, a highly optimized code.

SP is no longer a bottleneck in HTM pipeline, and further optimizations are planned as stated above (esp to SP’s local inhibition)

1 Like

Thanks breznak,

I now have a prototype for solving the MNIST dataset:

  • 95% accuracy
  • Uses local inhibition, with input radius of 4 pixels (MNIST uses images of 28x28 pixels)
  • Runs in 75 seconds (on my computer: Intel i7-4790K CPU @ 4.00GHz)
  • Prototype, lives in its own branch not master. Not ready for general consumption yet.