Hand-wavy analysis of HTM on FPGA synthesiser report

I managed to scrape some time together to compile one of Etaler’s OpenCL kernel into FPGA. These compilation takes hours on a large server. So I only have little to show. Hopefully some one in the future will find this post useful.

Setup

I’m using Altera/Intel’s OpenCL HLS Compiler to compile OpenCL into FPGA. And no optimization is attempted on making the kernels suitable for FPGA; they are used as-is. Also I’m targeting the quite old DE5-Net board, but it is the largest and the fastest I have. (I havce never actually ran HTM on it. I’m only using the synthesizer reports)

Quick specs:

Compiler version v16.0
FPGA Stratix V GX
RAM DDR3
Board DE5-Net

Analysis

The general trend is that we are totally bounded by memory access. And HTM algorithms are not pipeline-able by the compiler due to the nested loops.

For area analysis, we find that most of the area is used by the LSU(load store unit), used to access the DDR memory. Then is the local memory used by Etalter to accelerate operations. It would be possible to share the local buffer/LSU across kernels to reduce resource use (since we are unlikely to use them at the same time). But sharing local buffer is unfortunately out of the compiler’s spec; we might encounter undefined behaviour there.

Under the baseline setup, the core can at most run at 200MHz (The FPGA’s fabric frequency, not including propagation delays) and issues at most 8 bytes of memory access every cycle. The core is far from saturating the memory bus (runs at 1600MT/s).

I tried asking the compiler for SIMD optimization. This tells the compiler to perform multiple work under the same control flow. Fortunately this optimization barely uses any extra area. But might no be optimal when the control flow diverges (not a problem in inference, but quite a problem in learning).

The Multi-Comput Unit optimization works the same way SIMD does. But it allows differed control flow. It is more flexible but also used up a lot more area. - The synthesizing report supports this.

Then this is the hand wavy part. Based on the report and the reports only. I’ll suggest doing wide SIMD optimization for all HTM algorithms besides learning related ones (permanence update, growing new synapses). Also that HTM, including the learning algorithms, can nicely fit into an embedded grade FPGA (ex: The Cyclone V) and could saturate the memory bandwidth there easily (50MHz bus, 400MT/s DDR3).

7 Likes

Hello from the future! Yes, this was very useful thanks.

I’ve been experimenting with HDL designs of components of the HTM algorithms using MyHDL, which is a Python modeling language that can generate VHDL and Verilog from your designs. It’s handy if your python is great but you never used the VHDL or Verilog for anything of note. Of course, you have to understand digital design too.

I can get a lot of mileage making parallel computations of spatial pooler activations, but the learning is a different beast. I’ve been trying to think of clever design approaches to ignore some synaptic inputs when computing activation, but be able to follow up when learning.

It’s a fun distraction, but I haven’t been pursuing it in earnest.

2 Likes