I’m trying to optimize a bit the proximal and distal overlap computations [i.e. vector intersections] (seems to be a mayor performance hog), with SSE instructions. When using sorted vectors, accordingly [1], vector intersection can be optimized. The idea is to use the PCMPESTRM instruction, which is part of SSE4.2 (and allow to compare at once 8-entry vectors of 16-bits).

Unfortunately, after implement it, my pair of currently available processors and compilers shows little to no benefit. My guess is that PCMPESTRM implementation is not that good or that the compiler really does a good job with a conventional scalar approach [2]. Other more advanced approaches such as [3] are also useless.

I wonder if anyone around has some experience with code vectorization in NUPIC?

Maybe the union find algorithm on all the elements of the first set to make those elements into one connected component. Then check all the elements of the second set to see if they are in the connected component. I guess there would a problem with the comparison of the root of the connected component and a corresponding element in the second set. I presume there would be a way around that if such a scheme would do at all. Or would just hash tables be okay? Like a 50% loaded Robin Hood hash table. Or (again) like are you saying the cardinally of the sets is at most 2^16 in which case write in a 65536 byte array, read (check if present) in the array would maybe fit in the L1 cache of a higher end CPU. Maybe I misunderstand the problem?

Well, you can keep a list of current activations (as a sorted vector of id to the origin cell of the activations) and a sorted vector of connected synapses (indexing by the id of the origin cell of the synapses) for the segment. To compute the overlap you have just to intersect both vectors.

You are not restricted to 2^16 ids. You can partition the vector in 2-level hierarchy to use the PCMPESTRM. in the second one. In my case, Im using a hierarchical structure using the MSB of in the first level and the 16 LSB of the id in the second level. [1] uses a more “convoluted” approach…

Perhaps compacting somehow the vectors will improve cache efficiency. In any case, I guess that hardware prefetchers are working good (since we are using vectors), therefore there will little impact (i guess).

I might add that deletion and insertion to sorted distal segments would result in O(logn) (vanilla is O(1)) and from my CPU measurements, keeping potential and connected synapses of distal segments sorted is more costly than the actual overlap calculation in the long run if you have for example 100000 distal segments. HTM constantly adds and removes new synapses. At least in my implementation.

Segments with potential synapses that are on hash tables is pretty bad performance wise on insertion and deletion even if they are O(1). Probably because of rehashing due to collisions as the segments are variable size. Mind you this was using C++ std::unordered_map, so your results may vary.

You need to keep only a list of the connected synapses. My experience is that the total number of synapses per epoch that change from connected to non connected (or viceversa) is quite low. You only need to do the sort at the end of the epoch.

In my particular case i keep also all synapses in a sorted vector. I do this way because the saving in memory is quite large (like 2x less memory from std::unordered_map to vector). Since you only insert new synapses where there is a miss predicted activation, which should be infrequent in steady state, the performance loss is compensated by the memory savings.

On the other hand, the overlap with current distal activations, should be done for every segment in every cycle. In N is the number of activations and M the number of connected synapses in the current segment, this is O(N log M), (assuming sorted maps). Using sorted vectors, the complexity of the overlap is O(N+M). For regular sizes of N and M (40<N<100 and 16<M<128) seems like N+M<Nlog(M) and N+M < Mlog(N).

You have the activation vector sorted by source id. You have connected synapses vector sorted by source id. To calculate overlaps, you iterate through activations of N and find the corresponding connected synapse with the same source id among M. Am I wrong? If not, shouldn’t it be O(N log M) at best if you do a binary search (logM)? Can you directly access a connected synapse through only source id? If so, how is that a vector? Surely I am missing something here.

Yep… when you have two sorted vectors, the sorting helps you to infer what comes next. Lets assume that first1 is the pointer to first element in the activations vector and first2 is the connected synapses. Just the plain scalar algorithm is:

while (first1!=last1 && first2!=last2)
{
if (*first1<*first2) ++first1;
else if (*first2<*first1) ++first2;
else {
++overlap; ++first1; ++first2;
}
}

This is O(last1-first1+last2-first2). There is only one “loop”.

This can be improved to ease conditional branches for the branch prediction, vectorized comparisons, ease cache locality, etc… This algorithm is in the hearth of many DB and compression problems. Because of that, there is a substantial work on it.

Huh, that looks too fundamental not to know. Glad I asked.

Since you are specifically interested in cutting down overlap calculation costs, here’s my take on it:

Store the actual connected synapses on the source cells in a contiguous vector; not on the distal segments. Store pointers to them on connected synapses of the distal segments. On overlap computation phase, just access the cells in the activation vector and do the usual overlap increment thing. You are down to O(N) with solid cache locality in terms of synaptic data access. Of course you have to store parent segment information per synapse, but this is a nice trade-off.

That was my implementation, outlined in the thesis Section 5.4.3. So it is tested and fastest strategy I have tried yet, according to the Visual Studio profiler.

I think you are fighting the L1/L2/DRAM caching system and even if you went for a GPU solution such issues would still be the rate limiting step.
You could try a very L1 cache sensitive bit array solution. For example since you say the data is sorted then create the bit array by scanning backward the array for set 1, and forwards for the array for set 2. Since the CPU can potentially issue 2 or more instructions at the same time you might try hinting to the compiler to have 2 dependency chains in motion at once. Eg. by processing array elements pairwise.
Maybe you need one of these to really address the problem: http://www.micronautomata.com/

Actually, hardware prefetchers are really good at vectors. So, I don’t think that memory hierarchy is an issue here. The profiler don’t show a big performance impact for memory.

[3] tries to do that, i.e. avoid hard-to-predict branches to feed the processor pipeline. The truth is that even in toy simple examples the benefit of doing that is marginal.

HTM algorithms seems to be not really GPU friendly. Too sparse light computations and intensive communications. I don’t have FPGA based acceleration experience, but seems like the will have the same problem. Looks like the most viable solution is ASIC