Here-s a performance test on the basic counting overlaps of two SDRs
import numba
from htm.bindings.sdr import SDR
import numpy as np
# This is a "naive" python implementation which "compiles" well in numba
def py_overlap(n1,n2):
out = 0
i1, i2 = 0, 0
while i1 < n1.size and i2 < n2.size:
v1, v2 = n1[i1], n2[i2]
if v1 == v2:
out += 1
elif v1 > v2:
i2 += 1
continue
i1 += 1
return out
# This is first step of compilation, next one is to call it once so numba gets its input types
overlap = numba.njit(py_overlap)
sparse_random = lambda size, nbits: np.random.permutation(size).astype(np.uint16)[:nbits]
# Generate two random SDRs with 100/2000 sparsity
a = sparse_random(2000,100)
b = sparse_random(2000,100)
a.sort(), b.sort() # numba version needs them sorted
sdr_a = SDR(2000)
sdr_a.sparse = a
sdr_b = SDR(2000)
sdr_b.sparse = b
# Check they give same result
print(
sdr_a.getOverlap(sdr_b) == overlap(a,b)
)
Sorry for copy&paste, I found no means to attach .py files.
Assuming the above is saved in sdr_overlap_test.py :
$ ipython -i sdr_overlap_test.py
...
True
In [1]: %timeit overlap(a,b)
978 ns ± 9.26 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In[2]: %timeit sdr_a.getOverlap(sdr_b)
3.81 µs ± 46.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
I wonder how other implementations perform.
E.G. brainblocks I think uses a low level C bitarray and popcount() ?
Of course, overlap calculation as performed in HTM-scheme is embedded in a larger piece of code (maybe like getOverlap? - which implementation is that?)
numba being as fast and even faster than compiled python libraries isn’t that uncommon, one interesting case is pynndescent which is one of the fastest near neighbors search libraries for python.
It’s comparing the dense representations of the SDRs, instead of using the sparse index lists.
It could be rewritten to be faster.
Hi cezar,
I rewrote the C++ code using your example, and now the two implementations perform almost exactly the same.
Numba
6.137309419980738
C++
6.247461127000861
Percent difference: C++ is 1.8% slower than numba.
The difference in execution speed is probably because the C++ version has some book-keeping logic and assert statements to verify that the SDRs are the same size.
Update: I made a mistake. I improved the C++ implementation without also improving the python version. After fixing this issue the performance gap is even larger.
The improved python code is:
def py_overlap(n1,n2):
out = 0
i1, i2 = 0, 0
while i1 < n1.size and i2 < n2.size:
v1, v2 = n1[i1], n2[i2]
if v1 == v2:
out += 1
i1 += 1
# I added the following line, which merges the next
# loop iteration into this one.
i2 += 1
elif v1 > v2:
i2 += 1
else:
i1 += 1
return out
Numba: 5.447209668986034
C++: 5.751909351005452
Percent difference: C++ is 5.6% slower than numba.
Just matching the ON bits (what numba version does) doesn’t break anything if SDRs are different sizes, which might be good or bad, depending on the context.
Actually I like that but trying to sound neutral.
PS. yes, thanks, I noticed too late that i2 +=1 would indeed add a slight improvement.
IIRC: the reason that I originally wrote the htm-core code to compare dense representations is because at the time the sparse data was not guaranteed to be sorted. Later on in development we (I credit breznak for this suggestion) changed SDR::getSparse() to always be sorted (or else it’s an error, but it’s only checked in debug mode).
IMO: Checking that the SDR’s are the same size is a good thing. Getting the overlap between two SDRs of different sizes is almost certainly an error on the users part. SDRs will only have an overlap if they come from similar or the same sources.
I think you try to run it on the “dense” representation of a sdr.
e.g.
dense: 0010010110001
sparse: 2,5,7,8,12
Sparse is an ordered lists of positions set “ON”
@rogert,
I think to do a fair comparison you would need to also run the numba performance test as well, in order to re-establish the baseline. Everyone has different computer hardware and software so results from my computer are not directly comparable with results from yours. I’m running an old intel i7 from 2014 btw.
I always think if you don’t understand the logic in depth then it becomes a black box you have blind faith in and that’s where you end up with critical errors being made later on when it’s used for completely the wrong purpose. dense<>sparse.
This is a bad omen for C++. C++ had two things going for it, that no other programming language did:
High level abstractions
Performance
The most important “abstraction” is of course a resizable list. C does not have one, and so every large C program starts by defining one - usually badly. In C’s defense it was a third draft from the 1970’s. C++ added classes and the standard template library on top of C, giving it a winning combination of usability and speed.
For a long time no other language could compete with C++ in this space, but recently two new contenders have emerged: Rust and Numba. Rust is better than C++ and it has reached a “critical mass” of users. But there is room for more than one programming language in the world and so Rust does not necessarily spell doom for C++. However, numba makes comedy out of C++.
Just to be clear numba isn’t a language, only can accelerate python here and there. It isn’t by far a general python accelerator, It has its own quirks setbacks.
It might not handle well lots of other python code.
What it seems to be good at is at number vectors. Like a low-level toolkit to make your own array processing function that isn’t already in numpy/scipy or other data processing libs.
One interesting contender here is python’s bitarray external library.
As its name implies it implements arrays of bits but unlike python or numpy’s arrays of bools it actually uses bit representation internally as a “dense” encoding. That saves both space and speeds up certain operations as seen in the following example.
Not only that is faster than the above numba code, but its body is extremely compact and clear, as python should be:
from bitarray import bitarray
import numpy as np
SDR_SIZE = 2048
a = bitarray(list(np.random.randint(0,2,SDR_SIZE)))
b = bitarray(list(np.random.randint(0,2,SDR_SIZE)))
def overlap(x,y): return (x&y).count()
#
# in ipython:
In [78]: overlap(a,b)
Out[78]: 464
In [79]: %timeit overlap(a,b)
709 ns ± 21.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [80]: %timeit (a&b).count()
585 ns ± 13.2 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
100 ns lost by wrapping it in a function. A lambda version doesnt improve much.
ISTM that the Scheme overlap code shows one extreme of representations for the SDR abstraction and related code, and how using language-native features can benefit. Chez Scheme generates 18 X86-64 instructions for the 9 lines of Scheme forming the body of the loop. With the bit array representation the Scheme is one line ((bitwise-bit-count (bitwise-and n1 n2))), but the object code to call those library routines and their run time are both about 50% higher.
I ran the same test in Julia and it got ~50% higher speed than numba version.
Julia is a very interesting beast, at first sight it looks like a modern mashup of python (both conciseness and abilities), JIT compilation, native numpy-style array operations, interactive REPL with lots of Ipython-like features, jupyter notebook included but has its own more interactive notebook too.
And support for bit arrays that work the same as normal arrays!
Julia is the language I want to love but just can’t commit to yet, despite positive experience playing with it. Would love to see your implementation, if you’ve got that committed somewhere.