Current Fastest HTM Implementation?

I’m currently writing a highly optimized HTM implementation in Swift, and wanted to make sure I benchmark it against the fastest. Would the C++ nupic.core be the current fastest implementation, or are there community implementations which are faster? Also, any caveats in how to build/invoke nupic.core to ensure it is optimal?

I’ve been running this benchmark to compare them btw: https://github.com/numenta/nupic/blob/8e40e7ad16fd3a04cc2a7d3d12174ccf3fa44daa/scripts/temporal_memory_performance_benchmark.py

1 Like

Hi @PatrickPijnappel, yes, nupic.core C++ is your best bet. The benchmark will be helpful, thanks!

1 Like

Hi!

is your goto C++ implementation :wink:
As far as I know it would be the fastest full-featured, API-compatible, ready to use, and maintained HTM implementation.

See meta issue:

and more detailed

for optimizations we’ve identified and/or done.
The performance difference between nupic.core and nupic.cpp is very significant!

We could probably join forces on

it’d be nice if we can compile some sort of overall benchmark for

  • numenta nupic.core
  • community nupic.cpp
  • specialized community forks (swift, java, torch, …)

I’m not sure that benchmark is valid, you are better off using our real-life benchmarks which we run in CI as well.

  • src/examples/mnist //SP+classifier
  • src/examples/hotgym // “full chain” encoder->SP->TM->Anomaly

and micro benchmarks

  • in tests class ConnectionsPerformanceTest

See

You want to build in Release, but mosts common builds do by default.
Might consider setting some compiler flags (-O3 -march=native) but other that that I think you should stick with common.

There’s also a bunch of specialized HTM implementations that would be much faster than any C++ on a SPECIFIC task or conditions.
maybe there’s a thread, but from top of my head:

T’m working on a implementation in OpenCL and FPGA. I think Numenta is working on one in Torch.

@breznak Thanks for the in-depth answer! Cloned & built it, but will have to wait until the weekend to look at in more depth.

@marty1885 Is there any working GPU-based implementation at the moment?

No… Yes, but both currently private work and not publicly available. @jacobeverist have one and mine still needs more work. I’m now working on adding OpenCL<->CPU inter-op and making space for a FPGA backend. I’ll release and open-source mine in the next few month.

And to reply to the topic. I think my tiny-htm is the fastest now. It is a minimal HTM implementation designed to be fast. It is 14x faster then NuPIC.cpp (single thread). But tiny-htm is not fully compliant to the standard TM algorithm and has limited feature. And potentially buggy.

image

3 Likes

htm.cuda is a GPGPU SpatialPooler

@marty1885

How far along are you with OpenCL and FPGA? I think it’s been a couple months since our last post.

We have some people looking at the OpenCL->FPGA compilation problem right now. I believe we’re starting with the FPGA dev environment provided by Amazon Web Services.

1 Like

Long time!
We have two different codebase for OpenCL and FPGA as we are now working towards building a Verilog HTM core for embedded systems. Tho a OpemCL based FPGA design is still on the road map for high performance situations.

Yeah I agree that OpenCL on FPGA is kinda difficult to work with. I’m sure that you feel the pain.
Try not to synthesize the bitstream each time when your code changes. Use the simulator, they produces accurate performance numbers in relatively short time. (I’m not sure if this is doable on AWS tho)

Updating on the post. I have released my GPU accelerated HTM framework that is a few times faster than tiny-htm. See: https://discourse.numenta.org/t/releasing-etaler-a-very-fast-and-flexable-htm-framework-with-full-gpu-support/

1 Like

Damn 2-3us for the spatial pooler :open_mouth: and I thought shrinking my time down from 1 second to 2-300us for my Hex Grid pooler through some janky parallisation was good lol :rofl:

Can’t wait to see the python wrapper when its ready.

Momiji

~65 us for my FPGA implementation. System clock is 100 MHz.