Some experience with vectorization AVX512

I like to share some experience with vectorization of the SP and TP with the AVX512 instruction set. I didn’t want to highjack another discussion (Some experience with SIMD optimization?) because I’m only considering AVX512-BW (AVX-512 Byte and Word Instructions). I know not many of us may have access to a processor with such instruction set, but life is too short to program in assembly for ‘ancient’ technologies :unamused:. At the end of this post I have some general questions.

Good news and bad news: The bad news is that the speed of the (TP and SP) algorithms are dominated by gathers, and the only way to get some true performance increase out of gathers is to get rid of them. (I have some ideas on how to do that partially, but that for another discussion). Why are gathers snails: loads with irregular or unpredictable strides cannot be pre-fetched. Additionally, the amount of memory needed for the synapses does not generally fit L3 cache (8MB shared with all cores), the latencies for L3 cache misses are a staggering 42 cycles plus 51ns for my Skylake X.

Here is some c++ code taken from my Github repo (see below) that calculates the overlap in the SP, and as you can see down in the loop is a sole and yet deadly gather. (P has static parameters)

template <typename P>
void calc_overlap(
    const Layer<P>& layer,
    const Dynamic_Param& param,
    const typename Layer<P>::Active_Sensors& active_sensors,
    //out
    std::vector<int>& overlaps)
{
    for (auto column_i = 0; column_i < P::N_COLUMNS; ++column_i)
    {
        const auto& permanence = layer.sp_pd_synapse_permanence[column_i];
        const auto& synapse_origin = layer.sp_pd_synapse_origin_sensor[column_i];

        int overlap = 0;
        for (auto synapse_i = 0; synapse_i < P::SP_N_PD_SYNAPSES; ++synapse_i)
        {
            if (permanence[synapse_i] > P::SP_PD_PERMANENCE_THRESHOLD)
            {
                const auto origin_sensor = synapse_origin[synapse_i];
                overlap += active_sensors.get(origin_sensor); // gather here!
	        }
        }
        overlaps[column_i] = (overlap < P::SP_STIMULUS_THRESHOLD) ? 0 : overlap;
    }
}

Some good news: With the introduction of AVX2 gather instructions are added to the ISA. Yet no need to cheer, these instructions do not make the gather (i.e. memory access) much faster, they are added to get memory content at the right position in a register.

What have I tried (In a nutshell): a permanence is stored in an 8-bits integer, and 64 consecutive elements are loaded into a zmm register (that is one cache line, you could stream it such that you do not to pollute the cache since you won’t need them anymore). Compare this register with the connected-threshold which would yield a 64-bit mask. Load 16 indices of sensors, use these indices to gather the 16 bits indicating whether the 16 sensors are on or off. Add these bits up. Do this step 4 times, do the next 64 synapses. Good news: this is 2.4 times faster than the code shown above, but not the 16 times faster one would have hoped for from AVX512.

template <typename P>
void calc_overlap_avx512(
    const Layer<P>& layer,
    const Dynamic_Param& param,
    const typename Layer<P>::Active_Sensors& active_sensors,
    //out 
    std::vector<int>& overlaps) //size = P::N_COLUMNS
{
    const __m512i connected_threshold_epi8 = _mm512_set1_epi8(P::SP_PD_PERMANENCE_THRESHOLD);
    auto active_sensors_ptr = active_sensors.data();
    const int n_blocks = htm::tools::n_blocks_64(P::SP_N_PD_SYNAPSES);

    for (auto column_i = 0; column_i < P::N_COLUMNS; ++column_i)
    {
        auto permanence_epi8_ptr = reinterpret_cast<const __m512i *>(layer.sp_pd_synapse_permanence[column_i].data());
        auto origin_epi32_ptr = reinterpret_cast<const __m512i *>(layer.sp_pd_synapse_origin_sensor[column_i].data());

        __m512i overlap_epi32 = _mm512_setzero_epi32();

        for (int block = 0; block < n_blocks; ++block)
        {
            const __m512i permanence_epi8 = _mm512_load_si512(&permanence_epi8_ptr[block]); //load 64 permanence values
            const __mmask64 connected_mask_64 = _mm512_cmpgt_epi8_mask(permanence_epi8, connected_threshold_epi8);
            for (int i = 0; i < 4; ++i)
            {
                const __mmask16 mask_16 = static_cast<__mmask16>(connected_mask_64 >> (i * 16));
                const __m512i origin_epi32 = _mm512_load_si512(&origin_epi32_ptr[(block * 4) + i]);
                overlap_epi32 = _mm512_add_epi32(overlap_epi32, get_sensors_epi32(mask_16, origin_epi32, active_sensors_ptr));
            }
        }
        const int overlap_int = _mm512_reduce_add_epi32(overlap_epi32);
        overlaps[column_i] = (overlap_int < P::SP_STIMULUS_THRESHOLD) ? 0 : overlap_int;
    }
}

__m512i get_sensors_epi32(
    const __mmask16 mask,
    const __m512i origin_epi32,
    const void * active_sensors_ptr)
{
    const __m512i int_addr = _mm512_srli_epi32(origin_epi32, 5);
    const __m512i pos_in_int = _mm512_and_epi32(origin_epi32, _mm512_set1_epi32(0b11111));
    const __m512i sensor_int = _mm512_mask_i32gather_epi32(_mm512_setzero_epi32(), mask, int_addr, active_sensors_ptr, 4);
    const __m512i sensors = _mm512_srlv_epi32(sensor_int, pos_in_int);
    return _mm512_and_epi32(sensors, _mm512_set1_epi32(1));
}

There are a number of things that I have considered, eg. if the number of sensors is smaller (or equal) to 512, then the sensors activity can be squeezed into one zmm register, and you could do a gather on this register. Which is–as expected–very fast; sadly, the number of sensors is not often that small.

All the code can be found here.

If you have any experience with vectorization of the HTM code, or with streamlining the memory access patterns, please let me know what your experiences or insights are. Does not need to be AVX512.

1 Like

I don’t know if your SISD code is optimal. If you use std::set_intersection algorithm (from the standard library), you will achieve O(N+M), [where N is the number of columns and M the number of synapses indoor code]. You can compare your AVX implementation with that? To do that you need both lists to be sorted (perhaps by id).

Nevertheless, at least in my experience, is much faster to compute the overlap proactively, i.e. have a list of the postsynaptic cells connected to each “axon” (i.e. perm>thr) and increment a overlap counter in the source cell each time the axon is triggered.

Interesting dilemma, I may be that for small number of columns (5.000 or less) it may be faster to calculate the intersection. But I don’t like sorting because I’m aiming for 1M to 5M columns. But then again, I also don’t like wasting memory filled with zeros. I’m assuming sparsity (of active columns and sensor data) below 1/32, when sparse representations are more memory efficient compared to bitsets. I’ll do some experiments.

Yes, I agree that using the sparsity of the sensors does indeed give a much faster algorithm. I assume that is what you suggested. (This approach is not used in HTM.core.) I found that the calculation of the overlap may indeed be done a lot faster (simply because a much less gathers are needed, see the code below), but other code (e.g. prediction of next active sensor) was much more slower. Is your code available?

//Calculate overlap iterator over sensors: sensor backward (sb)
template <typename P>
void calc_overlap_sb_ref(
    const Layer<P>& layer,
    const Dynamic_Param& param,
    const typename Layer<P>::Active_Sensors& active_sensors,
    //out
    std::vector<int>& overlaps)
{
    for (auto sensor_i = 0; sensor_i < P::N_SENSORS; ++sensor_i)
    {
        if (active_sensors.get(sensor_i))
        {
            const auto& destination_columns = layer.sp_pd_destination_column_sb[sensor_i];
            const auto& permanences = layer.sp_pd_synapse_permanence_sb[sensor_i];

            for (auto i = 0; i < layer.sp_pd_synapse_count_sb[sensor_i]; ++i)
            {
                if (permanences[i] > P::SP_PD_PERMANENCE_THRESHOLD)
                {
                    const auto column_i = destination_columns[i];
                    overlaps[column_i]++; //gather here!
                }
            }
        }
    }
}

I see.

However, you only will have aprox. 2% of active columns. Therefore for your system size you will have up to 100K active cells. I think NUPIC code computes the overlap this way.

Note that to use std::set_intersection you only need to sort once such list per epoch (if you share the activated cell list across dendrites). Synapses in dendrites might be fairly high in SP though, but you can sort on-demand the list (i.e. when insert or delete synapses).

Unfortunately it is a WIP :frowning: I hope to be able to release it soon. Its a one-man-war and I have not enough time for :slight_smile: In any case my code is fairly disimilar to NUPIC. Its focused in simulate a hypothetical hw implementation and is really thighed to the physical distribution. There is a hierarchy of components (from cells to hyper columns) connected like in HW simulators.

Are you writing in VHDL, Verilog or SystemC? I used to have FPGA’s for breakfast, and I don’t mind code that does not compile or synthesize.

heheh… aren’t the FPGAs a bit hard for chew? Actually is just an
architectural simulator. Just C++… nothing to synthesize (yet) :slight_smile:

I wrote something like +year got, to get HTM’s teeth in, [0]. Just to get a
idea about I want to do. Now that thing “grow” a bit… and its a ugly
unstable beast (quite hard to handle). I hope be able to finish soon what I have
in hands (and clean a bit the code). I’ll keep you posted.

[0] https://arxiv.org/abs/1604.05897