Hand-wavy analysis of HTM on FPGA synthesiser report

I managed to scrape some time together to compile one of Etaler’s OpenCL kernel into FPGA. These compilation takes hours on a large server. So I only have little to show. Hopefully some one in the future will find this post useful.

Setup

I’m using Altera/Intel’s OpenCL HLS Compiler to compile OpenCL into FPGA. And no optimization is attempted on making the kernels suitable for FPGA; they are used as-is. Also I’m targeting the quite old DE5-Net board, but it is the largest and the fastest I have. (I havce never actually ran HTM on it. I’m only using the synthesizer reports)

Quick specs:

Compiler version v16.0
FPGA Stratix V GX
RAM DDR3
Board DE5-Net

Analysis

The general trend is that we are totally bounded by memory access. And HTM algorithms are not pipeline-able by the compiler due to the nested loops.

For area analysis, we find that most of the area is used by the LSU(load store unit), used to access the DDR memory. Then is the local memory used by Etalter to accelerate operations. It would be possible to share the local buffer/LSU across kernels to reduce resource use (since we are unlikely to use them at the same time). But sharing local buffer is unfortunately out of the compiler’s spec; we might encounter undefined behaviour there.

Under the baseline setup, the core can at most run at 200MHz (The FPGA’s fabric frequency, not including propagation delays) and issues at most 8 bytes of memory access every cycle. The core is far from saturating the memory bus (runs at 1600MT/s).

I tried asking the compiler for SIMD optimization. This tells the compiler to perform multiple work under the same control flow. Fortunately this optimization barely uses any extra area. But might no be optimal when the control flow diverges (not a problem in inference, but quite a problem in learning).

The Multi-Comput Unit optimization works the same way SIMD does. But it allows differed control flow. It is more flexible but also used up a lot more area. - The synthesizing report supports this.

Then this is the hand wavy part. Based on the report and the reports only. I’ll suggest doing wide SIMD optimization for all HTM algorithms besides learning related ones (permanence update, growing new synapses). Also that HTM, including the learning algorithms, can nicely fit into an embedded grade FPGA (ex: The Cyclone V) and could saturate the memory bandwidth there easily (50MHz bus, 400MT/s DDR3).

8 Likes

Hello from the future! Yes, this was very useful thanks.

I’ve been experimenting with HDL designs of components of the HTM algorithms using MyHDL, which is a Python modeling language that can generate VHDL and Verilog from your designs. It’s handy if your python is great but you never used the VHDL or Verilog for anything of note. Of course, you have to understand digital design too.

I can get a lot of mileage making parallel computations of spatial pooler activations, but the learning is a different beast. I’ve been trying to think of clever design approaches to ignore some synaptic inputs when computing activation, but be able to follow up when learning.

It’s a fun distraction, but I haven’t been pursuing it in earnest.

3 Likes

Hi Jacob!

I’m working on a similar thing in parallel. Maybe we should merge forces?

Do you have a repo anywhere with your project?

I’m more familiar with verilog (system verilog) myself, and will be spending the next couple of days putting serious thought time into this, such as what is required from a chip to carry out learning operations.

My main idea is that you won’t be able to keep state for the htm on an FPGA (just doesn’t have enough memory for that, nor should it). But instead looking at the operations that are required for HTM. I’m envisioning more of an HTM co-processor that works in concert with a CPU (across multiple threads). I’ve outlined how it can be done with current traditional hardware.

In my approach, the “Pool” is just a referee keeping track of which minicolumns best match the current input, picking winning minicolumns. It also keeps track of the winning cells from each of those minicolumns, so that the next timestep’s winning cells know who to update. So there are only four pieces of information that need to be worked on or updated by some device: previouswinners, currentwinners, minicolumn connection strength, and cell connections (not their strengthd… only their connections… brain doesn’t remember “strength” of proximal connections, just that they are there).
So my approach, to break it down:

  1. Find overlap scores for all columns in a pool.
  2. Pick “currentwinners!”… top 2%, for example.
  3. Have currentwinners strengthen their connections to input space
  4. Have currentwinners update the lists of previous previouswinners so that next time previouswinners “win”, they call out to and “vote” for “currentwinners” (potentially putting them into a predictive state).
  5. Have currentwinners check their own call list and vote for who they think will win the next round.

For example, for checking the overlap score between a minicolumn and an input space:

  • register to store column address in RAM (where results will be shipped out to)
  • binary comparator(s) to AND the inputspace bit and column mapping bit
  • an accumulators to increment when the AND output is HIGH (tracking score)
  • a counter in increment each, up to the input space size value (tracking bit index)

For this step, all we care about is finding overlap score across the columns; results in the counter would be stored out to RAM when index counter reaches its determined value, as the next column’s connection map is loaded in.

For cell selection and voting:
hardware setup (FPGA resources):

  • Previous winnercells register (memory addresses to their call-list arrays)
  • Current winnercells register (to be filled)
  • Registers for cell votes (register size matching number of bits required to hold all the vote values)
  • Registers for corresponding cells’ memory addresses

Procedure:

  • Load up previous winnercells register with previous timestep’s winners (
  • Load up columncell registers with votes and votebox addresses
  • Check if column was predictive (could be as simple as a bit flag)
  • If predictive, find cell with highest vote count, adding its address to call lists of previouswinners
  • If not predictive, find cell with lowest vote count, adding its address to the call lists of previous winners
  • Store winning cells call list addresses and current array index value to winnercells register

With these two pieces, it should be possible to implement HTM on an FPGA (at least it seems in my mind…).

I would focus first on getting an implementation of a spatial pooler working before tackling the much harder challenge of the temporal sequence learner. You will find that all of your memory and compute growth will come from that. If you can make a tight and efficient spatial pooler, you will be able to leverage the tools you built and make a lean sequence learner.

For an efficient implementation, you need to make several assumptions that don’t cost you too much.

  • represent activations at the bit level, 8 neuron winners per byte. Not only does this reduce your memory footprint by 1/8, but it also allows you to use efficient binary AND operations like you previously identified
  • take advantage of wide AND operations, if you can use 32 or 64 bit ANDs when applying a bitmask representing synapse connectivity with the input space, you’ve got your neuron’s synapse input very quickly
  • use POPCOUNT algorithms used for computing a neurons overlap score. You can find a lot of discussion about this on Wikipedia under the Hamming Weight
  • use unsigned integer words for your overlap scores. the size of the word can be determined by the size of your “potential pool” of synapses. you shouldn’t need more than 16 bits
  • use unsigned 8-bit integers for your synapse permanences. There’s no reason you should need any more resolution than that
  • unlike in the nupic implementation that has a potential pool of 80-90%, your potential pool should be much smaller, at around 10-20%. you can make up for it by adding more neurons
  • the big challenge is managing the synaptic addresses. you may actually find that it is faster and cheaper to represent the synaptic connectivity as a bit string instead of a list of addresses. Even the potential pool of permanences may be better represented as an array of uint8. It depends on the tradeoff of a larger memory footprint vs. the compute cost of pointer dereferencing.
  • avoid boosting. I’ve never figured out a way to make it stable, and it just adds floating points into the mix.
2 Likes