I managed to scrape some time together to compile one of Etaler’s OpenCL kernel into FPGA. These compilation takes hours on a large server. So I only have little to show. Hopefully some one in the future will find this post useful.
I’m using Altera/Intel’s OpenCL HLS Compiler to compile OpenCL into FPGA. And no optimization is attempted on making the kernels suitable for FPGA; they are used as-is. Also I’m targeting the quite old DE5-Net board, but it is the largest and the fastest I have. (I havce never actually ran HTM on it. I’m only using the synthesizer reports)
|FPGA||Stratix V GX|
The general trend is that we are totally bounded by memory access. And HTM algorithms are not pipeline-able by the compiler due to the nested loops.
For area analysis, we find that most of the area is used by the LSU(load store unit), used to access the DDR memory. Then is the local memory used by Etalter to accelerate operations. It would be possible to share the local buffer/LSU across kernels to reduce resource use (since we are unlikely to use them at the same time). But sharing local buffer is unfortunately out of the compiler’s spec; we might encounter undefined behaviour there.
Under the baseline setup, the core can at most run at 200MHz (The FPGA’s fabric frequency, not including propagation delays) and issues at most 8 bytes of memory access every cycle. The core is far from saturating the memory bus (runs at 1600MT/s).
I tried asking the compiler for SIMD optimization. This tells the compiler to perform multiple work under the same control flow. Fortunately this optimization barely uses any extra area. But might no be optimal when the control flow diverges (not a problem in inference, but quite a problem in learning).
The Multi-Comput Unit optimization works the same way SIMD does. But it allows differed control flow. It is more flexible but also used up a lot more area. - The synthesizing report supports this.
Then this is the hand wavy part. Based on the report and the reports only. I’ll suggest doing wide SIMD optimization for all HTM algorithms besides learning related ones (permanence update, growing new synapses). Also that HTM, including the learning algorithms, can nicely fit into an embedded grade FPGA (ex: The Cyclone V) and could saturate the memory bandwidth there easily (50MHz bus, 400MT/s DDR3).