Ideas for HTM on FPGAs

marty1885 · March 7, 2019, 10:29am

My friends and I will be implementing the Temporal Memory algorithm on a FPGA for an embedded AI contest in the next few months. I’m feeling excited! But before any of you fellow brain lovers also getting too excited. The HTM core I’ll be working might not have access to the main DRAM due to some limitations of the embedded platform. I can only hope we can solve it or it ended up not being an issue. And Unfortunately due to being on a embedded platform, We can’t just port TM into OpenCL and call it a day.

Anyway, are there any ideas or features that you want to see of a TM on an FPGA? Any possible architecture, any optimizations, or crazy things you want a TM to do? I’ll try my best to include them.

(I really want to get back to the 2D object project.

michaelklachko · March 12, 2019, 9:46pm

What task are you planning to do?

marty1885 · March 13, 2019, 12:18am

Just plan Temporal Memory for now. But the same structure can also be used for SP and other newer algorithms.

Edit: I think I misunderstood your question. I’m planing on doing anomaly detection on an embedded platform.

marty1885 · March 13, 2019, 5:37am

After some analysis. We found that HTM really needs a entire new architecture to be fast. Slapping a FPGA on the problem doesn’t fix anythig. The performance bottleneck by DRAM access.

FPGAs are good when the calculation/memory access ratio is goof, like what a DNN needs. But HTM only does simple calculations with lots of RAM access.

Using the traditional memory/processor architecture. In our best case

16-bit 800MHz DDR3 Memory
FPGA @ 200MHz (almost the limit on a embedded system)
assuming DMA transfers has 0 delay while gathering 4 bytes per cycle (100% memory bandwidth)
CPU does nothing, using no memory bandwidth
All synapses live on the DRAM, stored as 4 byte integers (no perm values, inference only)
Store all TM state stored on on-chip block RAM (1 cycle r/w simultaneously)

So to run a 2048 column, 16 cell per column, 256 synapse per cell TM; there are 2^23 synapses. Since we can get 1 synapse per cycle. We need 2^23 cycles, or 0.04194304s to perform one TM prediction. And this is the absolutely optimal speed, in reality we will be lucky to get even 50% of the memory bandwidth due to CPU and other interferences. While sure, I don’t need to access the entire synapse list but the connected ones, but just to keep the calculations simple here.

I think a more neuromorphic design is needed to run HTM fast. Maybe have to lot of small cores with their own SRAM attached to simulate columns and connected via a NOC. But I’m not sure that weather I have enough time to implement that for my contest. Getting such design working may as well be a Master’s degree thesis.

sunguralikaan · March 13, 2019, 1:01pm

Totally agreed.

Sounds like we need a brain to simulate the brain properly

MAK · March 13, 2019, 4:05pm

Hii,
why not using with GPU for that task ?
I think the new GPU architecture answer > to your wishes

lot of small cores with their own SRAM attached to simulate columns and connected via a NOC

marty1885 · March 13, 2019, 4:21pm

Well… good tought! But no, although GPUs are massively parallel and indeed contains a NOC. There’s no way to send messages programmaticly across the NOC to other cores. Further more, GPUs although advertised as having 4096 cores. They in fact have serval core-clusters. In which the cores perform generally the same task. (That’s where the name Vega 64 come from for AMD’s high end card. There are 64 clusters) But for HTM, we want the core capable of doing different things.

Something like the Epiphany III processor would be what I’m looking for. The cores can operate and communicate individually. But unfortunately the company behind the processor is dead now.

Anyway, the project is for an embedded system innovation contest sponsored by Xilinx in my country. So FPGAs it is.

Bitking · March 13, 2019, 4:24pm

My thought on the GPU thing is to communicate the state as a BLIT move and ship data around between processors as small images.

marty1885 · March 13, 2019, 4:31pm

I think it is not doable on modern architectures without going trough the VRAM. Local registers/RAM is not accessible from the outside. (At least on Vega, ARM-Mali and VideoCore IV, not sure about Turing and Intel’s upcoming one). I really hope someone can proof me wrong.

Bitking · March 13, 2019, 4:48pm

You say that like it is a bad thing.

What is the drawback from using the VRAM as a scratchpad?
The newer Maxwell based GTX cards have multi GB local storage and thousands of cores.

marty1885 · March 13, 2019, 4:59pm

Well… It takes ~100 cycles to read/write from VRAM. And since HTM typically reads from a random address (at least the access pattern is not linear), the very small cache on a GPU doesn’t help. And core-to-core synchronization is not really doable on large GPUs today. As I described above, we’ll have to synchronize core clusters, which is expensive and wastes a lot of cycles.

Oops, sorry for the confusion. GPU cores don’t really have access to the VRAM, data are brought into the core’s L1 cache by the texture unit. (VRAM access is quite like file I/O for a CPU to some degree). I mean local storage by the few KB of registers on the core and the shared SRAM within the cluster (if there’s one).

MaxLee · March 13, 2019, 9:26pm

Looks like they’ve gone the IP licensing route; a ton of their stuff was open-sourced wherever they could get away with it (see the footnotes on the paper for the version V).

Bitking · March 13, 2019, 9:57pm

If you are going with a custom chip anyway - you could use a cheap, smaller memory (or several to get a wider/faster memory interface) and a collection of smaller gate arrays. These could communicate with a high-speed matrix interface. The seams “between” local maps would be the biggest trouble area.

This has the charm that the connections around a given column should be mostly local anyway.

marty1885 · March 14, 2019, 4:34pm

After going trough both Xilinx and Altera’s OpenCL programming guide. Unfortunately both of them said nothing about having an persistant on-chip RAM. So the design I’ll be using for this contest will not be implementable in OpenCL on FPGAs or have to relay on undefined behaviour.

Cairo · March 15, 2019, 6:32am

You may want to contact Paul Franzon. He has implemented both Numenta HTMs and Gerard Rinkus’s Sparsey algorithm on ASIC, SIMD and GPU architectures. I’ve been working closely Dr. Rinkus on his algorithm and he had sent me several papers on Paul Franzon’s results. A very basic outline of the results is here. I’m not sure if I’m allowed to show the papers I have but you can talk to Franzon at least. I’m sure he can at least give you some pointers on where to start and fixes to the problems he had when designing the test.

I do know that, at least with the ASIC implementation, HTMs had to be redesigned for new tasks (if I’m to understand the paper correctly) as opposed to Sparsey which is more generic and capable of re-purposing. So it’s possible that you may have to deal with task specific tricks to get around the RAM issues. He did mention that the RAM was the biggest weakness of both algorithms, relatively large amounts were needed. Regardless, he may have some insight of what you’ll need to do.

Good luck.

Topic		Replies	Views
Exploring htm.core and the TM parameters NuPIC Community Fork	11	904	January 23, 2023
My analysis on why Temporal Memory prediction doesn't work on sequential data Numenta Theory sequence-memory	58	7429	February 2, 2020
HTM + OpenCL Implementations htm-implementations , opencl	11	2669	April 4, 2017
Porting HTM Models to the Heidelberg Neuromorphic Computing Platform Related Papers	3	1316	February 7, 2017
Knowm.org has anyone taken a good look at their tech as being more efficient? NuPIC	3	643	July 16, 2016

Ideas for HTM on FPGAs

Related topics