I like to share some experience with vectorization of the SP and TP with the AVX512 instruction set. I didn’t want to highjack another discussion (Some experience with SIMD optimization?) because I’m only considering AVX512-BW (AVX-512 Byte and Word Instructions). I know not many of us may have access to a processor with such instruction set, but life is too short to program in assembly for ‘ancient’ technologies . At the end of this post I have some general questions.
Good news and bad news: The bad news is that the speed of the (TP and SP) algorithms are dominated by gathers, and the only way to get some true performance increase out of gathers is to get rid of them. (I have some ideas on how to do that partially, but that for another discussion). Why are gathers snails: loads with irregular or unpredictable strides cannot be pre-fetched. Additionally, the amount of memory needed for the synapses does not generally fit L3 cache (8MB shared with all cores), the latencies for L3 cache misses are a staggering 42 cycles plus 51ns for my Skylake X.
Here is some c++ code taken from my Github repo (see below) that calculates the overlap in the SP, and as you can see down in the loop is a sole and yet deadly gather. (P has static parameters)
template <typename P>
void calc_overlap(
const Layer<P>& layer,
const Dynamic_Param& param,
const typename Layer<P>::Active_Sensors& active_sensors,
//out
std::vector<int>& overlaps)
{
for (auto column_i = 0; column_i < P::N_COLUMNS; ++column_i)
{
const auto& permanence = layer.sp_pd_synapse_permanence[column_i];
const auto& synapse_origin = layer.sp_pd_synapse_origin_sensor[column_i];
int overlap = 0;
for (auto synapse_i = 0; synapse_i < P::SP_N_PD_SYNAPSES; ++synapse_i)
{
if (permanence[synapse_i] > P::SP_PD_PERMANENCE_THRESHOLD)
{
const auto origin_sensor = synapse_origin[synapse_i];
overlap += active_sensors.get(origin_sensor); // gather here!
}
}
overlaps[column_i] = (overlap < P::SP_STIMULUS_THRESHOLD) ? 0 : overlap;
}
}
Some good news: With the introduction of AVX2 gather instructions are added to the ISA. Yet no need to cheer, these instructions do not make the gather (i.e. memory access) much faster, they are added to get memory content at the right position in a register.
What have I tried (In a nutshell): a permanence is stored in an 8-bits integer, and 64 consecutive elements are loaded into a zmm register (that is one cache line, you could stream it such that you do not to pollute the cache since you won’t need them anymore). Compare this register with the connected-threshold which would yield a 64-bit mask. Load 16 indices of sensors, use these indices to gather the 16 bits indicating whether the 16 sensors are on or off. Add these bits up. Do this step 4 times, do the next 64 synapses. Good news: this is 2.4 times faster than the code shown above, but not the 16 times faster one would have hoped for from AVX512.
template <typename P>
void calc_overlap_avx512(
const Layer<P>& layer,
const Dynamic_Param& param,
const typename Layer<P>::Active_Sensors& active_sensors,
//out
std::vector<int>& overlaps) //size = P::N_COLUMNS
{
const __m512i connected_threshold_epi8 = _mm512_set1_epi8(P::SP_PD_PERMANENCE_THRESHOLD);
auto active_sensors_ptr = active_sensors.data();
const int n_blocks = htm::tools::n_blocks_64(P::SP_N_PD_SYNAPSES);
for (auto column_i = 0; column_i < P::N_COLUMNS; ++column_i)
{
auto permanence_epi8_ptr = reinterpret_cast<const __m512i *>(layer.sp_pd_synapse_permanence[column_i].data());
auto origin_epi32_ptr = reinterpret_cast<const __m512i *>(layer.sp_pd_synapse_origin_sensor[column_i].data());
__m512i overlap_epi32 = _mm512_setzero_epi32();
for (int block = 0; block < n_blocks; ++block)
{
const __m512i permanence_epi8 = _mm512_load_si512(&permanence_epi8_ptr[block]); //load 64 permanence values
const __mmask64 connected_mask_64 = _mm512_cmpgt_epi8_mask(permanence_epi8, connected_threshold_epi8);
for (int i = 0; i < 4; ++i)
{
const __mmask16 mask_16 = static_cast<__mmask16>(connected_mask_64 >> (i * 16));
const __m512i origin_epi32 = _mm512_load_si512(&origin_epi32_ptr[(block * 4) + i]);
overlap_epi32 = _mm512_add_epi32(overlap_epi32, get_sensors_epi32(mask_16, origin_epi32, active_sensors_ptr));
}
}
const int overlap_int = _mm512_reduce_add_epi32(overlap_epi32);
overlaps[column_i] = (overlap_int < P::SP_STIMULUS_THRESHOLD) ? 0 : overlap_int;
}
}
__m512i get_sensors_epi32(
const __mmask16 mask,
const __m512i origin_epi32,
const void * active_sensors_ptr)
{
const __m512i int_addr = _mm512_srli_epi32(origin_epi32, 5);
const __m512i pos_in_int = _mm512_and_epi32(origin_epi32, _mm512_set1_epi32(0b11111));
const __m512i sensor_int = _mm512_mask_i32gather_epi32(_mm512_setzero_epi32(), mask, int_addr, active_sensors_ptr, 4);
const __m512i sensors = _mm512_srlv_epi32(sensor_int, pos_in_int);
return _mm512_and_epi32(sensors, _mm512_set1_epi32(1));
}
There are a number of things that I have considered, eg. if the number of sensors is smaller (or equal) to 512, then the sensors activity can be squeezed into one zmm register, and you could do a gather on this register. Which is–as expected–very fast; sadly, the number of sensors is not often that small.
All the code can be found here.
If you have any experience with vectorization of the HTM code, or with streamlining the memory access patterns, please let me know what your experiences or insights are. Does not need to be AVX512.