Current Fastest HTM Implementation?

PatrickPijnappel · April 9, 2019, 9:28pm

I’m currently writing a highly optimized HTM implementation in Swift, and wanted to make sure I benchmark it against the fastest. Would the C++ nupic.core be the current fastest implementation, or are there community implementations which are faster? Also, any caveats in how to build/invoke nupic.core to ensure it is optimal?

I’ve been running this benchmark to compare them btw: https://github.com/numenta/nupic/blob/8e40e7ad16fd3a04cc2a7d3d12174ccf3fa44daa/scripts/temporal_memory_performance_benchmark.py

brev · April 10, 2019, 2:42am

Hi @PatrickPijnappel, yes, nupic.core C++ is your best bet. The benchmark will be helpful, thanks!

breznak · April 10, 2019, 6:00am

Hi!

is your goto C++ implementation
As far as I know it would be the fastest full-featured, API-compatible, ready to use, and maintained HTM implementation.

See meta issue:

github.com/htm-community/htm.core

Optimization for performance

opened 10:57AM - 19 Jan 18 UTC

breznak

SP TM optimization

Steps: - [ ] set baseline **benchmarking** tests, the more, the better - mi…cro benchmarks - IDE profiling - real life benchmark hotgym) #30 - [x] **refactor** code to use shared, encapsulated class for passing around data, "SDR type" - for now it could be `typedef UInt*`, - later wrap vector, add some methods, - even later wrap opt-Matrix type,... - [ ] identify bottlenecks - [ ] vectorize - almost all the optimization libraries work on vectors - replace usecases where we have `setPermanence(newValue)` called in a `loop`, with vectorized version (a scalar can be a vector with 1 item) - [ ] compare math library toolkits - the library have their data type (EIgenMatrix, etc) - converting to/from it will kill the (gained) performance -> "SDR type" - [ ] iterative optimizations Requirements: - what we want from the library? - speed - multi-platform - sparse (memory efficient) - big user-base, popular - low code "intrusiveness" - CPU backend (SSE, openMP) - nVidia GPU backend (CUDA) - AMD GPU backend (openCL) - open source - clean & lean API (ease of use) - bindings/support for other languages (python,...) - I don't need no optimizations Considered toolkits: - Blaze https://bitbucket.org/blaze-lib/blaze - Eigen http://eigen.tuxfamily.org/index.php?title=Main_Page - other? let me know Links: - nicely detailed issue https://github.com/numenta/nupic.core/issues/28 Please **read**, ideas, code,... - Eigen vs Blaze discussion: https://news.ycombinator.com/item?id=10117971

and more detailed

github.com/htm-community/htm.core

Connections Optimization

opened 11:50PM - 14 Dec 18 UTC

ctrl-z-9000-times

optimization

I think that `computeActivity` could be made faster with better book-keeping. F…or reference: ``` void Connections::computeActivity( vector<UInt32> &numActiveConnectedSynapsesForSegment, vector<UInt32> &numActivePotentialSynapsesForSegment, const vector<CellIdx> &activePresynapticCells, Permanence connectedPermanence) const { for (CellIdx cell : activePresynapticCells) { if (synapsesForPresynapticCell_.count(cell)) { for (Synapse synapse : synapsesForPresynapticCell_.at(cell)) { const SynapseData &synapseData = synapses_[synapse]; ++numActivePotentialSynapsesForSegment[synapseData.segment]; if (synapseData.permanence >= connectedPermanence - EPSILON) { ++numActiveConnectedSynapsesForSegment[synapseData.segment]; }}}}} ``` It should not check the permanence here. Synapses don't change connect state all that often, and when they do the connections class will know. Instead the connections class should keep two lists of synapses for each presynaptic cell. * Replace `synapsesForPresynapticCell_` with `connectedSynapsesForPresynapticCell_` and `potentialSynapsesForPresynapticCell_`. * The compute method can then remove the line `if(synapseData.permanence >= connectedPermanence)`. Replace it with two for-loops: first iterate through `connectedSynapsesForPresynapticCell_` to compute `numActiveConnectedSynapsesForSegment`. Then copy `numActiveConnectedSynapsesForSegment` to `numActivePotentialSynapsesForSegment`. Finaly, iterate through `potentialSynapsesForPresynapticCell_` to finish computing `numActivePotentialSynapsesForSegment`. This way reduces the amount of random-access incrementing by the size of `numActiveConnectedSynapsesForSegment`, in favour of copying a numSegments sized buffer. Furthermore, the SpatialPooler uses the numActiveConnectedPresyn but does not use the numActivePotentialPresyn, so by splitting this into two methods, SpatialPooler's compute method can avoid seeing the potential synapses altogether (until it learns of course). Futhermore, there are other ppl who want to be notified when a synapse changes state, such as connectedCount book keeping, and possibly `ConnectionsEventHandler::onUpdateSynapsePermanence`. So it makes sense to have a single place in connections where it checks for this event and deals with it: method `Connections::updateSynapsePermanence`. EDIT: Next steps: > I assume all your work has been merged? Can we close this as resolved? Ideas for optimization which are not yet implemented: * https://github.com/htm-community/nupic.cpp/issues/161#issuecomment-447604560 * https://github.com/htm-community/nupic.cpp/issues/161#issuecomment-448684141

for optimizations we’ve identified and/or done.
The performance difference between nupic.core and nupic.cpp is very significant!

We could probably join forces on

github.com/htm-community/htm.core

Publish performance analysis and optimization options & methods

opened 07:17PM - 12 Dec 18 UTC

breznak

optimization research

Relevant: #3 #50 - [ ] discuss ways to measure performance: - [ ] Hotgy…m benchmark #30 - micro benchmarks - time per SP/TM/... - random vs "logical" data - profiling - [ ] factors of performance - a function of #columns x #cells x #inputField x local/global-Inhibition ,.... - [ ] compare PY, bindings, numenta c++, our c++ versions - [ ] graphs of time/speed for each - [ ] discuss (un)imlemented ways to optimize: - smaller data types (uint16, byte, float16 ?) -> smaller cache footprint - [ ] vectorized - CUDA #50 , Eigen - [ ] parallel regions (SP, TM Regions run in parallel) #253 - [ ] parallel loops (c++17 TS parallel ) #214 - [ ] parallel core algorithms (SP, TM, Connections) #254 - [x] removed ASM - [ ] removed SparseBinaryMatrix ? #169 - implement Batch mode for SP - compiler optimizations (LTO, PGO, Ofast, ...) - https://stackoverflow.com/questions/14492436/g-optimization-beyond-o3-ofast - profiling (what parts take the most?)

it’d be nice if we can compile some sort of overall benchmark for

numenta nupic.core
community nupic.cpp
specialized community forks (swift, java, torch, …)

I’m not sure that benchmark is valid, you are better off using our real-life benchmarks which we run in CI as well.

src/examples/mnist //SP+classifier
src/examples/hotgym // “full chain” encoder->SP->TM->Anomaly

and micro benchmarks

in tests class ConnectionsPerformanceTest

See

github.com/htm-community/htm.core

Real-life benchmark: Hotgym example using C++ algorithms

opened 06:38PM - 24 Jan 18 UTC

breznak

high optimization example

Implement a pipeline running full real-world HTM task. Currently implemented… using raw HTM classes (TM, SP,...), not NetworkAPI (needs TM/SPRegion), not as python code using c++ bindings (would be possible). Pipeline: - [x] compile as standalone executable (for profiling) - [ ] load CSV from file - [ ] use our classical "hotgym" dataset - [ ] parse command-line for optional filename and num runs - [x] encode CSV data - [ ] MultiEncoder for more fields than 1 - [x] run SpatialPooler to get SDR - [x] global - [x] local inhibition - [x] run TP to get temporal predictions - [x] use more modern TM as alternative - [x] ~~TP (old)~~ obsoleted - [x] ~~BacktrackingTM (TP based)~~ obsoleted - [x] show SDR output computed by these TM flavours - [x] also checks deterministic algorithms' outputs #194 - [x] compute Anomaly score - [ ] test AnomalyLikelihood - [ ] add SDR Classifier - [ ] ~~needs encoder topDownCompute (SDR -> Real)~~ decided WONTFIX - [x] measure execution time - [x] more fine-grained separate timers for each part of pipeline (SP, Encoder, TM,..) - [ ] fine grained timer checks for each part - [ ] implement as a class to make more reusable - [x] use `SDR` for all layers - [ ] SDR Metrics - [ ] enforce common Compute/Serializable/... interface - [ ] implement using core algorithms (SP, TM) - [x] encoder - [x] SP - [x] TM - [x] AN - [ ] classifier - [ ] predictor - [ ] CP (later, when implemented) #285 - [ ] implement using NetworkAPI - [ ] test parallelization #255 - [ ] test interfaces - [ ] serialization - [ ] optimize parameters #433 ----------------- We are looking for a real-life benchmark we can use as a base for our performance optimizations #3 . In Python there is a "Hotgym anomaly example" (stresses encoder, SP, TM, Anomaly) , implement similar example in C++ and add it to integration-tests with timing. - suggested with NAPI > I have ported SPRegion and TMRegion and they run under windows but I was waiting until after PyBind was implemented to merge them in with a later PR. Waiting for #54 SPRegion & TMRegion in C++

You want to build in Release, but mosts common builds do by default.
Might consider setting some compiler flags (-O3 -march=native) but other that that I think you should stick with common.

There’s also a bunch of specialized HTM implementations that would be much faster than any C++ on a SPECIFIC task or conditions.
maybe there’s a thread, but from top of my head:

CUDA: fast SP in CUDA
GitHub - htm-community/htm.cuda: Implementation of HTM Spatial Pooler algorithm in CUDA
Torch: @marty1885 is developing a fast implementation in torch (?)
TODO find others?

marty1885 · April 10, 2019, 8:09am

T’m working on a implementation in OpenCL and FPGA. I think Numenta is working on one in Torch.

PatrickPijnappel · April 10, 2019, 1:10pm

@breznak Thanks for the in-depth answer! Cloned & built it, but will have to wait until the weekend to look at in more depth.

@marty1885 Is there any working GPU-based implementation at the moment?

marty1885 · April 10, 2019, 2:57pm

No… Yes, but both currently private work and not publicly available. @jacobeverist have one and mine still needs more work. I’m now working on adding OpenCL<->CPU inter-op and making space for a FPGA backend. I’ll release and open-source mine in the next few month.

marty1885 · April 10, 2019, 3:01pm

And to reply to the topic. I think my tiny-htm is the fastest now. It is a minimal HTM implementation designed to be fast. It is 14x faster then NuPIC.cpp (single thread). But tiny-htm is not fully compliant to the standard TM algorithm and has limited feature. And potentially buggy.

breznak · April 10, 2019, 3:06pm

htm.cuda is a GPGPU SpatialPooler

jacobeverist · April 10, 2019, 5:17pm

@marty1885

How far along are you with OpenCL and FPGA? I think it’s been a couple months since our last post.

We have some people looking at the OpenCL->FPGA compilation problem right now. I believe we’re starting with the FPGA dev environment provided by Amazon Web Services.

marty1885 · April 10, 2019, 5:33pm

Long time!
We have two different codebase for OpenCL and FPGA as we are now working towards building a Verilog HTM core for embedded systems. Tho a OpemCL based FPGA design is still on the road map for high performance situations.

Yeah I agree that OpenCL on FPGA is kinda difficult to work with. I’m sure that you feel the pain.
Try not to synthesize the bitstream each time when your code changes. Use the simulator, they produces accurate performance numbers in relatively short time. (I’m not sure if this is doable on AWS tho)

marty1885 · June 3, 2019, 1:51am

Updating on the post. I have released my GPU accelerated HTM framework that is a few times faster than tiny-htm. See: https://discourse.numenta.org/t/releasing-etaler-a-very-fast-and-flexable-htm-framework-with-full-gpu-support/

momiji · June 3, 2019, 6:06pm

Damn 2-3us for the spatial pooler and I thought shrinking my time down from 1 second to 2-300us for my Hex Grid pooler through some janky parallisation was good lol

Can’t wait to see the python wrapper when its ready.

Momiji

AMZ · November 22, 2019, 3:05pm

~65 us for my FPGA implementation. System clock is 100 MHz.

Topic		Replies	Views
HTM + OpenCL Implementations htm-implementations , opencl	11	2672	April 4, 2017
Development Guide for nupic-core on Python3.8 NuPIC question , community	2	560	November 19, 2022
NuPic core C++ vs Other language performance benchmarks? NuPIC benchmark	13	1408	August 9, 2016
Nupic v.s. htm.core v.s. ___ NuPIC htm-implementations , nupic	2	789	December 26, 2020
Htm.core and Python 3.11 compatibility NuPIC Community Fork	6	1130	November 20, 2024

Current Fastest HTM Implementation?

Related topics