HTM on VLIW will (possibly) be slow

Just sharing in case someone is looking into the same thing.

I mentioned in my project report Hierarchical Temporal Memory Agent in standard Reinforcement Learning Environment that HTM on a VLIW may be a good idea. That the instructions can be well scheduled into a consistent, 3~4 instruction/block bundles. And I proofed it by writing the assembly by hand.

Yesterday I found out that Compiler Explorer supports Kalray’s VLIW processor. So I put a put the SP overlapping algorithm into it and tried. - The result is disappointing. Their compiler spits out code that effectively turns a VLIW into a single-issue in-order processor (a lot of single instruction bundles). I’m not sure if this is caused by their VLIW architecture or the the compiler not able to find a optimal schedule. In any case, HTM on their processor with the current compiler will be quite slow. And I hope HTM on VLIW can still be fast.

Source code: https://godbolt.org/z/owBBpr

2 Likes

RISC V may be a better choice because it is quite flexible!

1 Like

It’s totally true. But flexibility have it’s cost, mainly power and chip area. It’ll be great if we can live without it. Then computing HTM will be a lot faster.

Agree, but any undesired increase in power and chip area can be overcome by using smaller technology node.

I think SPMD ISA extensions are the weapon of choice here. Do you known https://ispc.github.io/ ?

Dependencies (as usual) will make WLIW quite hard to schedule statically. On contrast current OOO processor , and state-of-the-art SPMD AVX512 (which a beast), can fly. If you combine that with SMT,…

1 Like

Yes, I agrees SIMD is the way to go. But it’ll be great if VLIW does the trick (see the image, some hypothetical code I wrote a while ago). VLIW is a lot less power hungry compared to OoOE and uses less space compared to SIMD. If VLIW works then we could build a cluster of DSPs to run HTM. It’ll be faster, cheaper and cooler than any other method.

image

Regarding to SIMD. I tried to get Intel’s ICC compiler to generate SIMD. But it always generates scalar code even if a fixed number of cell and synapses are provided. I think manual factorization is required

1 Like

This is not a useful suggestion but on the topic of VLIW, have you seen this: https://millcomputing.com/

Automatic vectorization is quite limited. ISPC is a high level toolkit to use intel extensions without entering in the compiler intrinsic nightmare. (You might need to “rethink” the data structures to fully exploit it, though)

VLIW will not work, I’m afraid. The system presumably will require large quantity of memory. Hence, misses across the memory hierarchy will be frequent. Static scheduling is rather weak in that context.

1 Like

Yes, I know that project. But it seems to be in the research stage and now available even to outside parties.