After some analysis. We found that HTM really needs a entire new architecture to be fast. Slapping a FPGA on the problem doesn’t fix anythig. The performance bottleneck by DRAM access.
FPGAs are good when the calculation/memory access ratio is goof, like what a DNN needs. But HTM only does simple calculations with lots of RAM access.
Using the traditional memory/processor architecture. In our best case
- 16-bit 800MHz DDR3 Memory
- FPGA @ 200MHz (almost the limit on a embedded system)
- assuming DMA transfers has 0 delay while gathering 4 bytes per cycle (100% memory bandwidth)
- CPU does nothing, using no memory bandwidth
- All synapses live on the DRAM, stored as 4 byte integers (no perm values, inference only)
- Store all TM state stored on on-chip block RAM (1 cycle r/w simultaneously)
So to run a 2048 column, 16 cell per column, 256 synapse per cell TM; there are 2^23 synapses. Since we can get 1 synapse per cycle. We need 2^23 cycles, or 0.04194304s to perform one TM prediction. And this is the absolutely optimal speed, in reality we will be lucky to get even 50% of the memory bandwidth due to CPU and other interferences. While sure, I don’t need to access the entire synapse list but the connected ones, but just to keep the calculations simple here.
I think a more neuromorphic design is needed to run HTM fast. Maybe have to lot of small cores with their own SRAM attached to simulate columns and connected via a NOC. But I’m not sure that weather I have enough time to implement that for my contest. Getting such design working may as well be a Master’s degree thesis.