HTM on VLIW will (possibly) be slow

marty1885 · June 14, 2020, 7:48am

Just sharing in case someone is looking into the same thing.

I mentioned in my project report Hierarchical Temporal Memory Agent in standard Reinforcement Learning Environment that HTM on a VLIW may be a good idea. That the instructions can be well scheduled into a consistent, 3~4 instruction/block bundles. And I proofed it by writing the assembly by hand.

Yesterday I found out that Compiler Explorer supports Kalray’s VLIW processor. So I put a put the SP overlapping algorithm into it and tried. - The result is disappointing. Their compiler spits out code that effectively turns a VLIW into a single-issue in-order processor (a lot of single instruction bundles). I’m not sure if this is caused by their VLIW architecture or the the compiler not able to find a optimal schedule. In any case, HTM on their processor with the current compiler will be quite slow. And I hope HTM on VLIW can still be fast.

Source code: https://godbolt.org/z/owBBpr

AMZ · June 14, 2020, 12:12pm

RISC V may be a better choice because it is quite flexible!

marty1885 · June 14, 2020, 12:37pm

It’s totally true. But flexibility have it’s cost, mainly power and chip area. It’ll be great if we can live without it. Then computing HTM will be a lot faster.

AMZ · June 14, 2020, 4:40pm

Agree, but any undesired increase in power and chip area can be overcome by using smaller technology node.

vpuente · June 15, 2020, 7:44am

I think SPMD ISA extensions are the weapon of choice here. Do you known https://ispc.github.io/ ?

Dependencies (as usual) will make WLIW quite hard to schedule statically. On contrast current OOO processor , and state-of-the-art SPMD AVX512 (which a beast), can fly. If you combine that with SMT,…

marty1885 · June 15, 2020, 11:19am

Yes, I agrees SIMD is the way to go. But it’ll be great if VLIW does the trick (see the image, some hypothetical code I wrote a while ago). VLIW is a lot less power hungry compared to OoOE and uses less space compared to SIMD. If VLIW works then we could build a cluster of DSPs to run HTM. It’ll be faster, cheaper and cooler than any other method.

Regarding to SIMD. I tried to get Intel’s ICC compiler to generate SIMD. But it always generates scalar code even if a fixed number of cell and synapses are provided. I think manual factorization is required

dmac · June 15, 2020, 3:15pm

This is not a useful suggestion but on the topic of VLIW, have you seen this: https://millcomputing.com/

vpuente · June 15, 2020, 3:38pm

Automatic vectorization is quite limited. ISPC is a high level toolkit to use intel extensions without entering in the compiler intrinsic nightmare. (You might need to “rethink” the data structures to fully exploit it, though)

VLIW will not work, I’m afraid. The system presumably will require large quantity of memory. Hence, misses across the memory hierarchy will be frequent. Static scheduling is rather weak in that context.

marty1885 · June 16, 2020, 1:32am

Yes, I know that project. But it seems to be in the research stage and now available even to outside parties.

Topic		Replies	Views
IRAM Chips and Sparse Representation Implementations question	46	1612	June 15, 2021
Hand-wavy analysis of HTM on FPGA synthesiser report Implementations	3	955	April 18, 2020
Ideas for HTM on FPGAs Engineering	14	852	March 15, 2019
A flexable framework for HTM algorithms. (And another HTM implementation no one asked for) Implementations htm-implementations	28	2155	March 2, 2019
Current Fastest HTM Implementation? Implementations	12	1126	November 22, 2019

HTM on VLIW will (possibly) be slow

Related topics