SDR overlap speed tests numba vs

Here-s a performance test on the basic counting overlaps of two SDRs

import numba
from htm.bindings.sdr import SDR
import numpy as np

# This is a "naive" python implementation which "compiles" well in numba
def py_overlap(n1,n2):
    out = 0
    i1, i2 = 0, 0
    while i1 < n1.size and i2 < n2.size:
        v1, v2 = n1[i1], n2[i2]
        if v1 == v2:
            out += 1
        elif v1 > v2:
            i2 += 1
            continue
        i1 += 1
    return out

# This is first step of compilation, next one is to call it once so numba gets its input types
overlap = numba.njit(py_overlap)


sparse_random = lambda size, nbits: np.random.permutation(size).astype(np.uint16)[:nbits]

# Generate two random SDRs with 100/2000 sparsity
a = sparse_random(2000,100)
b = sparse_random(2000,100)
a.sort(), b.sort()  # numba version needs them sorted

sdr_a = SDR(2000)
sdr_a.sparse = a
sdr_b = SDR(2000)
sdr_b.sparse = b 

# Check they give same result
print(
     sdr_a.getOverlap(sdr_b) == overlap(a,b)
) 

Sorry for copy&paste, I found no means to attach .py files.
Assuming the above is saved in sdr_overlap_test.py :

$ ipython -i sdr_overlap_test.py
...
True

In [1]: %timeit overlap(a,b) 
978 ns ± 9.26 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In[2]: %timeit sdr_a.getOverlap(sdr_b)
3.81 µs ± 46.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

I wonder how other implementations perform.
E.G. brainblocks I think uses a low level C bitarray and popcount() ?

(I’m not a fan of micro benchmarks, but…) I translated py_overlap into Chez Scheme (as used for implementation of HTM-scheme) as follows:

(let* (
    [sparse-random (lambda (size nbits)
                     (list-sample (iota size) nbits))]
    [a (sparse-random 2000 100)]
    [b (sparse-random 2000 100)])
  (define (overlap n1 n2)
    (let loop ([n1 n1] [n2 n2] [out 0])
      (cond
        [(or (null? n1) (null? n2)) out ]
        [(fx=? (car n1) (car n2))
          (loop (cdr n1) (cdr n2) (fx1+ out)) ]
        [(fx<? (car n1) (car n2))
          (loop (cdr n1) n2 out) ]
        [else
          (loop n1 (cdr n2) out) ])))
  (time
    (do ([i 1000000 (fx1- i)]) ((fxzero? i))
      (set! ov (overlap a b))))

Time is about 317ns per loop.

Of course, overlap calculation as performed in HTM-scheme is embedded in a larger piece of code (maybe like getOverlap? - which implementation is that?)

1 Like

Cool, that’s much faster. I assume Chez Scheme itself is compiled?

The htm.core I imported should be the one from GitHub - htm-community/htm.core: Actively developed Hierarchical Temporal Memory (HTM) community fork (continuation) of NuPIC. Implementation for C++ and Python

numba being as fast and even faster than compiled python libraries isn’t that uncommon, one interesting case is pynndescent which is one of the fastest near neighbors search libraries for python.

htm core source code

It’s comparing the dense representations of the SDRs, instead of using the sparse index lists.
It could be rewritten to be faster.


Hi cezar,

I rewrote the C++ code using your example, and now the two implementations perform almost exactly the same.

Numba
6.137309419980738

C++
6.247461127000861

Percent difference: C++ is 1.8% slower than numba.

The difference in execution speed is probably because the C++ version has some book-keeping logic and assert statements to verify that the SDRs are the same size.


Update: I made a mistake. I improved the C++ implementation without also improving the python version. After fixing this issue the performance gap is even larger.

The improved python code is:

def py_overlap(n1,n2):
    out = 0
    i1, i2 = 0, 0
    while i1 < n1.size and i2 < n2.size:
        v1, v2 = n1[i1], n2[i2]
        if v1 == v2:
            out += 1
            i1 += 1
            # I added the following line, which merges the next
            # loop iteration into this one.
            i2 += 1
        elif v1 > v2:
            i2 += 1
        else:
            i1 += 1
    return out

Numba: 5.447209668986034

C++: 5.751909351005452

Percent difference: C++ is 5.6% slower than numba.

Ha, that’s great.

Just matching the ON bits (what numba version does) doesn’t break anything if SDRs are different sizes, which might be good or bad, depending on the context.

Actually I like that but trying to sound neutral.

PS. yes, thanks, I noticed too late that i2 +=1 would indeed add a slight improvement.

IIRC: the reason that I originally wrote the htm-core code to compare dense representations is because at the time the sparse data was not guaranteed to be sorted. Later on in development we (I credit breznak for this suggestion) changed SDR::getSparse() to always be sorted (or else it’s an error, but it’s only checked in debug mode).

IMO: Checking that the SDR’s are the same size is a good thing. Getting the overlap between two SDRs of different sizes is almost certainly an error on the users part. SDRs will only have an overlap if they come from similar or the same sources.

Hi @dmac, what are the units here? (5.44 xs/y, what are x and y?)

Is the following the intended output ?

v1 = 11111111 v2 = 01010101 out = 4
v1 = 10111111 v2 = 01010101 out =5
v1 = 11110111 v2 = 01010101 out =4 (position of 0)

v1 = 00101010 v2 = 11110001 out = 3
v1 = 01101010 v2 = 11110001 out = 5 (1 bit change)

v 1= 01101010 v2 = 11110XXX out =5 (value of X has no influence on output)
v 1= 01101010 v2 = 111101XX out =4 (value of X has no influence on output)

The units are seconds, and it ran for 10 million iterations.

Here is the code snippet I used:

from timeit import timeit
timeit(lambda: overlap(a,b) , number=int(1e7))
timeit(lambda: sdr_a.getOverlap(sdr_b), number=int(1e7))

Cezar’s function accepts its inputs in a different format than you are showing.
It accepts a list of indexes of the non-zero elements in the array.

The inputs to the function should be:
n1 = [0 2 3 4 5 6 7]
n2 = [1 3 5 7]

Which returns an overlap of 3.


The purpose of the SDR class in htm-core is to save you from having to worry about this type of thing. :slight_smile:

$ scheme --program overlap.wp
(time (do ((...)) ...))
    no collections
    2.769824135s elapsed cpu time
    2.770227000s elapsed real time
    0 bytes allocated

So Chez Scheme could be ~twice speed of Numba/C++ (modulo machine clock)?
(This was fastest of 7 runs, slowest was 2.94) I tweaked code slightly:

(let* (
    [_ (random-seed (+ 1 (modulo (time-second (current-time)) 1000)))]
    [sparse-random (lambda (size nbits)
                     (u32-sample (iota size) nbits))]
    [a (sort fx<? (sparse-random 2000 100))]
    [b (sort fx<? (sparse-random 2000 100))])
  (define (overlap n1 n2)
    (let loop ([n1 n1] [n2 n2] [out 0])
      (if (or (null? n1) (null? n2))  out
        (let ([v1 (car n1)] [v2 (car n2)])
          (cond
            [(fx=? v1 v2)
              (loop (cdr n1) (cdr n2) (fx1+ out)) ]
            [(fx<? v1 v2)
              (loop (cdr n1) n2 out) ]
            [else
              (loop n1 (cdr n2) out) ])))))
  (time
    (do ([i 10000000 (fx1- i)]) ((fxzero? i))
      (overlap a b)))

I think you try to run it on the “dense” representation of a sdr.
e.g.
dense: 0010010110001
sparse: 2,5,7,8,12
Sparse is an ordered lists of positions set “ON”

1 Like

@rogert,
I think to do a fair comparison you would need to also run the numba performance test as well, in order to re-establish the baseline. Everyone has different computer hardware and software so results from my computer are not directly comparable with results from yours. I’m running an old intel i7 from 2014 btw.

I always think if you don’t understand the logic in depth then it becomes a black box you have blind faith in and that’s where you end up with critical errors being made later on when it’s used for completely the wrong purpose. dense<>sparse.

This is a bad omen for C++. C++ had two things going for it, that no other programming language did:

  1. High level abstractions
  2. Performance

The most important “abstraction” is of course a resizable list. C does not have one, and so every large C program starts by defining one - usually badly. In C’s defense it was a third draft from the 1970’s. C++ added classes and the standard template library on top of C, giving it a winning combination of usability and speed.

For a long time no other language could compete with C++ in this space, but recently two new contenders have emerged: Rust and Numba. Rust is better than C++ and it has reached a “critical mass” of users. But there is room for more than one programming language in the world and so Rust does not necessarily spell doom for C++. However, numba makes comedy out of C++.

A sign of the times?

Just to be clear numba isn’t a language, only can accelerate python here and there. It isn’t by far a general python accelerator, It has its own quirks setbacks.
It might not handle well lots of other python code.
What it seems to be good at is at number vectors. Like a low-level toolkit to make your own array processing function that isn’t already in numpy/scipy or other data processing libs.

1 Like

One interesting contender here is python’s bitarray external library.
As its name implies it implements arrays of bits but unlike python or numpy’s arrays of bools it actually uses bit representation internally as a “dense” encoding. That saves both space and speeds up certain operations as seen in the following example.

Not only that is faster than the above numba code, but its body is extremely compact and clear, as python should be:

from bitarray import bitarray
import numpy as np

SDR_SIZE = 2048

a = bitarray(list(np.random.randint(0,2,SDR_SIZE)))
b = bitarray(list(np.random.randint(0,2,SDR_SIZE)))

def overlap(x,y): return (x&y).count()

#
# in ipython: 

In [78]: overlap(a,b)
Out[78]: 464

In [79]: %timeit overlap(a,b)
709 ns ± 21.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [80]: %timeit (a&b).count()
585 ns ± 13.2 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

100 ns lost by wrapping it in a function. A lambda version doesnt improve much.

1 Like

ISTM that the Scheme overlap code shows one extreme of representations for the SDR abstraction and related code, and how using language-native features can benefit. Chez Scheme generates 18 X86-64 instructions for the 9 lines of Scheme forming the body of the loop. With the bit array representation the Scheme is one line ((bitwise-bit-count (bitwise-and n1 n2))), but the object code to call those library routines and their run time are both about 50% higher.

2 Likes

I ran the same test in Julia and it got ~50% higher speed than numba version.

Julia is a very interesting beast, at first sight it looks like a modern mashup of python (both conciseness and abilities), JIT compilation, native numpy-style array operations, interactive REPL with lots of Ipython-like features, jupyter notebook included but has its own more interactive notebook too.
And support for bit arrays that work the same as normal arrays!

Its %timeit equivalent even charts a histogram! :

julia> @benchmark overlap(a,b)
BenchmarkTools.Trial: 10000 samples with 196 evaluations.
Range (min … max): 468.357 ns … 1.997 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 468.474 ns ┊ GC (median): 0.00%
Time (mean ± σ): 486.995 ns ± 91.603 ns ┊ GC (mean ± σ): 0.00% ± 0.00%

█ ▁ ▂ ▁ ▁ ▁
███▇▅███▇█▇▇▇▇▆▆▆▄▆▅▆▅▅▄▅▅▆▅▄▆▄▅▆▆▅▆▅▅█▅▄▅▅▅▅▄▄▃▄▃▄▃▃▁▄▁▁▄▁█ █
468 ns Histogram: log(frequency) by time 849 ns <

Memory estimate: 0 bytes, allocs estimate: 0.

2 Likes

Julia is the language I want to love but just can’t commit to yet, despite positive experience playing with it. Would love to see your implementation, if you’ve got that committed somewhere. :slight_smile: