Performance optimizations



A highly desired topic for many of us.
I’ve started the work in

Related discussions:


  • [ ] set baseline benchmarking tests, the more, the better
    • micro benchmarks
    • IDE profiling
  • [ ] refactor code to use shared, encapsulated class for passing around data, “SDR type”
    • for now it could be typedef UInt*,
    • later wrap vector, add some methods,
    • even later wrap opt-Matrix type required by the library,…
  • [ ] identify bottlenecks
  • [ ] compare math library toolkits
    • the library have their data type (EIgenMatrix, etc)
    • converting to/from it will kill the (gained) performance -> “SDR type”
  • [ ] iterative optimizations


  • what we want from the library?
  • max speed
  • multi-platform
  • sparse (memory efficient)
  • big user-base, popular
  • low code “intrusiveness”
  • CPU backend (SSE, openMP)
  • nVidia GPU backend (CUDA)
  • AMD GPU backend (openCL)
  • generic GPU backend (both AMD, nV; likely on openCL?)
  • open source
  • clean & lean API (ease of use, high level)
  • bindings/support for other languages (python,…)
  • used in other AI/ML frameworks (TensorFlow, SCIKIT.learn, torch,…)
  • I don’t need no optimizations

0 voters

Considered toolkits:


Memory reduction/optimization

As an alternative way of optimization, I came up with


I think the vote is missing the number one priority and that is “easy to understand”.


That one is “low code intrusiveness”~ minimum number of changes needed in the current code to implement the optimizations. or it could also be “I don’t need no optimizations”)


@chhenning do you have some say for Eigen (vs Blaze) as pybind11 has support for Eigen?

Also, I’m going to do some refactoring in encoders,SP,TM (maybe touch regions) regarding introducing a common typedef for “SDRvector”, I should probably coordinate this work with your removal of some classes (after you have the PR).


@breznak I’m very open for discussing the selection of the right matrix lib. My thinking is that we need a proper benchmark to make the right choice. I’m willing to work on that.


proper benchmark to make the right choice. I’m willing to work on that.


  • bin/connections_performance_tests which show some numbers
  • Ithink we could do micro benchmarks just for some specific methods as we try to change them
  • a good overall bench is the “Hotgym anomaly example”, I don’t know if it’s in Py only, or if there’s cpp variant.
    • we could either port it, or run from Py with cpp bindings.
    • I think it reflects very well the complex use of HTM (stresses almost all parts)

I’m ready with unit tests, so we have free hands to start experimenting!
I plot some plans here

  • would like to start with the “common type for regions I/O”, and vectorization… Your pybind changes don’t touch the individual files in src/nupic/algorithms/ , do they?


do we know of more people interested in this change who would be willing to contribute? As I can prepare/refactor the stuff above, but I don’t have much actuall exp with the graphics/math libraries. So learning without guidance would take me longer…


Can you provide a link?

I’m fairly certain that’s only python. But for now python is the main user of nupic.core and maybe the benchmark should take that into account.

Anythin python is limited to the /nupic/python folder. There are some python remnants in the engine code but that hopefully will disappear soon.

So, no, no pybind stuff in the algorithms.


Good point,
We should have the releases now

And the bindings
So we could use the Py code that would itself call the c++ bindings - the most real-world tests.

I also have some private code for running the Hotgym in c++, so I’ll clean it up and we can set up pure c++ benchmarks (imho easier and more fool proof game).

That’s sth I’d like to understand in the PR, do you remove the Regions and NAPI “framework”/functionality? Sorry to being dense on your changeset, I just need to understand the direction.

Cool, so other tasks can go in parallel with the Py3 work!


Here are some more details:

  1. To be exact I have not removed “Regions”. The only region I have reimplemented is PyRegion using pybind11 functionality. I have named that class PyBindRegion…

  2. What I have further done is remove nupic’s py_support folder.

  3. Right now the engine is an integral part of nupic.core and supports the inclusion of python regions, like “” for instance. For that I need to use embedded python via pybind11.

  4. Until we refactor the “engine” we’ll have to build nupic.core with pybind11 and therefore python c api headers.

I hope this makes sense.


Thank you. That sounds very sensible and well designed. I was worried you just scratched the Region* classes :+1:



Funny the description for dlib is wrong in awesome-cpp. It should be:

“A toolkit for making real world machine learning and data analysis applications in C++”


Hi all,

I think I know how make topology in the spatial pooler run a lot faster. I call it poor mans topology, use many small spatial poolers arranged over a topological area, and each spatial pooler has global inhibition. It’s not a perfect solution but it should run as fast as the no-topology case.

I have experimented with this method with my own HTM implementation with success on the MNIST dataset. I use a (10 x 10) grid of spatial poolers, each with 100 mini-columns, and an input radius of about 2.8 pixels for a total of 106 potential synapses to each mini-column. It takes 4 minutes to run through the MNIST dataset and it scores ~95%.

Any thoughts? Is this something you’d all want in the community fork?


Score 95% is comparable to other HTM experiments …