My friends and I will be implementing the Temporal Memory algorithm on a FPGA for an embedded AI contest in the next few months. I’m feeling excited! But before any of you fellow brain lovers also getting too excited. The HTM core I’ll be working might not have access to the main DRAM due to some limitations of the embedded platform. I can only hope we can solve it or it ended up not being an issue. And Unfortunately due to being on a embedded platform, We can’t just port TM into OpenCL and call it a day.
Anyway, are there any ideas or features that you want to see of a TM on an FPGA? Any possible architecture, any optimizations, or crazy things you want a TM to do? I’ll try my best to include them.
(I really want to get back to the 2D object project.
After some analysis. We found that HTM really needs a entire new architecture to be fast. Slapping a FPGA on the problem doesn’t fix anythig. The performance bottleneck by DRAM access.
FPGAs are good when the calculation/memory access ratio is goof, like what a DNN needs. But HTM only does simple calculations with lots of RAM access.
Using the traditional memory/processor architecture. In our best case
16-bit 800MHz DDR3 Memory
FPGA @ 200MHz (almost the limit on a embedded system)
assuming DMA transfers has 0 delay while gathering 4 bytes per cycle (100% memory bandwidth)
CPU does nothing, using no memory bandwidth
All synapses live on the DRAM, stored as 4 byte integers (no perm values, inference only)
Store all TM state stored on on-chip block RAM (1 cycle r/w simultaneously)
So to run a 2048 column, 16 cell per column, 256 synapse per cell TM; there are 2^23 synapses. Since we can get 1 synapse per cycle. We need 2^23 cycles, or 0.04194304s to perform one TM prediction. And this is the absolutely optimal speed, in reality we will be lucky to get even 50% of the memory bandwidth due to CPU and other interferences. While sure, I don’t need to access the entire synapse list but the connected ones, but just to keep the calculations simple here.
I think a more neuromorphic design is needed to run HTM fast. Maybe have to lot of small cores with their own SRAM attached to simulate columns and connected via a NOC. But I’m not sure that weather I have enough time to implement that for my contest. Getting such design working may as well be a Master’s degree thesis.
Well… good tought! But no, although GPUs are massively parallel and indeed contains a NOC. There’s no way to send messages programmaticly across the NOC to other cores. Further more, GPUs although advertised as having 4096 cores. They in fact have serval core-clusters. In which the cores perform generally the same task. (That’s where the name Vega 64 come from for AMD’s high end card. There are 64 clusters) But for HTM, we want the core capable of doing different things.
Something like the Epiphany III processor would be what I’m looking for. The cores can operate and communicate individually. But unfortunately the company behind the processor is dead now.
Anyway, the project is for an embedded system innovation contest sponsored by Xilinx in my country. So FPGAs it is.
I think it is not doable on modern architectures without going trough the VRAM. Local registers/RAM is not accessible from the outside. (At least on Vega, ARM-Mali and VideoCore IV, not sure about Turing and Intel’s upcoming one). I really hope someone can proof me wrong.
Well… It takes ~100 cycles to read/write from VRAM. And since HTM typically reads from a random address (at least the access pattern is not linear), the very small cache on a GPU doesn’t help. And core-to-core synchronization is not really doable on large GPUs today. As I described above, we’ll have to synchronize core clusters, which is expensive and wastes a lot of cycles.
Oops, sorry for the confusion. GPU cores don’t really have access to the VRAM, data are brought into the core’s L1 cache by the texture unit. (VRAM access is quite like file I/O for a CPU to some degree). I mean local storage by the few KB of registers on the core and the shared SRAM within the cluster (if there’s one).
Looks like they’ve gone the IP licensing route; a ton of their stuff was open-sourced wherever they could get away with it (see the footnotes on the paper for the version V).
If you are going with a custom chip anyway - you could use a cheap, smaller memory (or several to get a wider/faster memory interface) and a collection of smaller gate arrays. These could communicate with a high-speed matrix interface. The seams “between” local maps would be the biggest trouble area.
This has the charm that the connections around a given column should be mostly local anyway.
After going trough both Xilinx and Altera’s OpenCL programming guide. Unfortunately both of them said nothing about having an persistant on-chip RAM. So the design I’ll be using for this contest will not be implementable in OpenCL on FPGAs or have to relay on undefined behaviour.
You may want to contact Paul Franzon. He has implemented both Numenta HTMs and Gerard Rinkus’s Sparsey algorithm on ASIC, SIMD and GPU architectures. I’ve been working closely Dr. Rinkus on his algorithm and he had sent me several papers on Paul Franzon’s results. A very basic outline of the results is here. I’m not sure if I’m allowed to show the papers I have but you can talk to Franzon at least. I’m sure he can at least give you some pointers on where to start and fixes to the problems he had when designing the test.
I do know that, at least with the ASIC implementation, HTMs had to be redesigned for new tasks (if I’m to understand the paper correctly) as opposed to Sparsey which is more generic and capable of re-purposing. So it’s possible that you may have to deal with task specific tricks to get around the RAM issues. He did mention that the RAM was the biggest weakness of both algorithms, relatively large amounts were needed. Regardless, he may have some insight of what you’ll need to do.