Using HTM for branch prediction

Thank you!

First of all, thanks for the suggestion. I have not implemented this encoder, but I thought about it a bit and I don’t think it would be possible to perform this encoding process as fast as I would need it for this application. Computing the pseudo-random generator on-the-fly and several times for each SDR seems to be very complex to optimize for. Please, correct me if I’m mistaken.

I was not even thinking about a table, but a hardwired solution. Since the synapses in the network are much more expensive to implement than the encoding process, I didn’t think it would be a problem to use this method.

A HW random generator can be really simple (v.gr.using xor gates) [althouhg not very good]. In any case, TM will take you >100 of clock cycles. Since the CLA algorithm is easy “pipelineable”,… the encoding should be not an issue.

The benefit of this is that it can increase the history of the PC substantially.

That might be impractical if you want some flexibility. For example, if you need static partitioning (let say because you are using SMT). Additionally, wiring is not free. It might require less area but certainly will have a negative impact on power.

In any case, the work is really nice.

1 Like

Oh, are you sure there is not a way to make it faster? Even if it’s “pipelineable”, I don’t think it would ever make sense to even think about using TM for branch prediction if it takes that much time to run.

That’s true, but according to my tests, the use of more history does not provide that much benefit = /

I’m sorry, I’m not familiar with this.

Yeah, it would require some really nice improvements in materials in order to work well.

Thank you very much! I’m glad you liked it.

Don’t think so. With 1024 mini-columns there is a lot of work to do each cycle. If you have a large budget in area and power, it could be improved, but hardly it will be faster than that. In any case, the LSTM paper shows this as a “theoretical” experience… my understanding is that your assumptions are in the same direction. [ And LSTM inference will be millions of clock cycles per “prediction” :slight_smile: and with offline training]

The only way to make it faster is to use smaller systems (few tens of mini-columns) and have a hierarchy :wink:

SMT stands for Simultaneous Multithreading. Some processors, such as Intel’s, statically partition per thread branch predictor tables (and history register) when you enable SMT (Intel call it Hyper-threading).

2 Likes

My understanding of the topic was a little off, then hahaha

Yes, they were, but I didn’t expect it to be for that much hahaha

Right, I can see how hardwired solutions would prevent that.