Using HTM for branch prediction

@Jose_Cueto and @Charles_Rosenbauer (as there are some overlapping ideas from both of you)

Until now I was just thinking to use the TM in order to get those patterns that could be spotted by a human just by looking at it in not much time. The ideia of using more not-so-human-friendly data to make these predictions would require a SP and a different way to represent the output, right? More like a continuous output, rather than a binary one.

Also, I’ll better explain the problem with using these data I was refering to: on a local predictor, nearby instructions data would not be useful, since they do not change. So only registors data could be useful (because even cache would delay too much to deliver anything), but by the time the predictor needs this data, the content on them do not refer to that specific instruction, but to some instructions before (the ones further in the pipeline), so the TM would probably have to see a indirect pattern here (sorry, I think it still is confusing, hope you understand it). However, a global predictor could perhaps make use of the instructions to get some insights. It would be interesting to test.

@Charles_Rosenbauer Thank you so much for the detailed post. Your comment has actually hurt my hope on getting interesting results (or any at all), but the earlier I discover this, the better, so I can even change the topic of my thesis if needed. I knew most of what you wrote on the first paragraphs, but the last takeaways were really valuable.

So, I have some questions. I don’t expect you to answer them with so many details (as @Jose_Cueto said, it’s my job to make further investigations, I just need a path =])

Q: Is it possible that a TM can replace a Two-Level Predictor table with less hardware and more of less the same results, but learning in less cycles than a perceptron?
Expected answer: No, a TM would require more hardware.

Q: Is it too difficult (or even impossible) to rewind a TM? So, is the solution for rewinding be using copies of all the parameters of the model? If it is, hardware will be definely a great problem.
Expected answer: there is a way, but it would be very difficult to implement (isn’t default TM too difficult already?).

Q: Since I expect ‘No’ for the first question, a TM would only make sense as a global predictor, right? One that could be reused, forgetting about instructions not used in some time). My idea is to combine a 2-bit saturating counter to help the TM when it’s not so sure (or, from other perspective, use a TM to improve the 2-bit saturating counter when possible)

Q: (From above) The ideia of using more not-so-human-friendly data to make these predictions would require a SP and a different way to represent the output, right? More like a continuous output, rather than a binary one.

Q: How can Memory Prefetching be modeled as a temporal problem? (don’t waste too much time on this one haha) [EDIT: @vpuente has linked an interesting paper below. I think I can find the answer reading it beyond the abstract, which I’ll do later]

Thanks so much for everyone that is taking some time to guide me on this topic.

2 Likes

I would look at prefetchers. LSTM seems to work there ( https://arxiv.org/pdf/1803.02329.pdf ). HTM should be better at that. My experience is that HTM is reasonably good predicting the PC, but load addresses stream can have a superior impact on performance than BP. Be aware that conventional BP are already pretty good.

PS: a “small” HTM can be done in HW. I think that an LSTM is impossible (at least from the perspective of the core).

1 Like

Depends on how long the history is. If it’s a small one, no. 8 history bits and 2048 entries I think are pretty common numbers for Two-Level Predictors in modern, high-end processors. That’s 128kB of internal state, or ~6.3 million transistors, which is quite a bit to play with. If you have an HTM network with say, 256 neurons, 512 synapses per neuron, and 3 bits per synapse, that would require ~2.4 million transistors for the memory. Not sure about the logic, but you could probably make it work.

I’m not sure. I think what you could do here would be to only actually adjust synapse weights after a branch is retired (meaning that it has been verified to have been correct). The advantage of this would be that you’d only have to store which neurons were active during each timestep (timesteps probably being every point when a new branch showed up). In that case, if your pipeline can handle say, 15 branches in-flight, you’d have to store all the neuron activity from the previous 15 or 16 branches. That’s not much; if you only have 256 neurons, and you need 1 bit per neuron per timestep, that’s ~25k transistors for 16 timesteps. Practically nothing compared to the rest of the predictor.

The only potential problem is the impact that it might have of not learning until after a branch is confirmed to have been correct, though my intuition tells me this might be negligible. Especially because if it got the branch correct, then it’s model probably doesn’t need too much adjusting. If it got the branch wrong, it’ll have to throw away all the subsequent branches anyway, so it doesn’t matter what it predicted for them.

I haven’t heard of any CPUs that have done much in terms of global predictors before. Most CPUs will have a separate predictor for each core because switching to another program is a great way to scramble all the data in your predictor, slowing your program to a crawl.

That said, perhaps an HTM network with sufficient capacity could do a better job there. The main problem is latency. CPUs have multi-level caches for a reason. Grabbing a value from an L1 cache might take 3 cycles. L2 might take 15. L3 might take 50. RAM will take ~300. The problem is that the further away something is from a core, the higher the latency of the communication.

It’s likely that a global predictor on a modern CPU would take way too long to get a response from to be useful for branch prediction.

If you have some way of getting the data out of HTM, I don’t see any reason why you couldn’t just use normal HTM. If you can get reinforcement learning working on an HTM network, everything should be just fine.

Well, if you can get reinforcement learning working with HTM, you could train it to issue prefetches based on the current instruction pointer, recent successful prefetches, etc. Have it so that it can issue prefetches based on a certain range from the most recent load perhaps. Basically, just have it select the offset from the most recent load/prefetch. Not entirely certain what the best reward function would be, but it should probably take into account for both cases where the prefetcher fails to grab the right data, and when it grabs data that isn’t needed. Both are situations that are good to avoid.

The impact of a prefetcher would probably be quite larger than that of a branch predictor, especially as we start scaling to larger numbers of cores; scaling core count is much more easier than scaling memory bandwidth, so managing on-chip memory more intelligently is generally a good idea.

1 Like

Ok so we are working on a local BP cheers for mentioning it. So if I may ask just to clarify things, given your ideas on how to use TM (ignoring HW roadblocks first), what is/are your input to the TM? What’s on your mind so far, could you please explain? The reason why I’m asking is to evaluate again if this input might be enough to show moderately lasting sequences which can be predictable by a TM.

You’ll need an SP if you need to classify your inputs into more robust representations and allow the TM to work on these inputs effectively.

1 Like

@Charles_Rosenbauer thanks a lot, specially on the notion of necessary hardware and tips for investigating Memory Prefetching. I still don’t know what you mean when you talk about using reinforcement learning to solve some problems, but that’s because I haven’t done the necessary reading.

I’ll definitely look more into Memory Prefetching. If it’s a better suited task for an HTM system, I’m willing to switch my focus since it’s a related topic somehow.

@Jose_Cueto I’m sorry, various topics were pointed out through this thread, so the ideas I had in mind were iterated over without a proper explanation for each of them.
I’ll make a summary of the final forms of the ideas brought into the discussion until now and my conclusions about them.

Idea #1: multiple local predictors, one for each branch (up to a a certain limit). Input: 1 or 0 (branch taken or not). The prediction should not be difficult to extract from the model. These HTM implementations should be very small and optimized in order to compete with a Two-Level Predictor. Conclusion: too much hardware would be needed.

Idea #2: a single global predictor for all branches. Input: instruction addresses. The prediction would not be as easy to extract. It would allow for a bigger HTM implementation, but maybe not enough. Conclusion: multithreading would ruin patterns unless the predictor keeps track of each thread and deal with them separately, which probably would result in multiple predictors actually, each of them predicting for one thread. Again, probably too much hardware.

Idea #3: one single predictor working with local data. Input: instruction addresses and local branching history. The patterns would not be trivial to spot, and the prediction would also be harder to extract. No conclusions.

Idea #4: one single predictor working with global data. Input: instruction addresses, maybe local branching history, nearby instructions, registers data, etc. Same caracteristics as #3. No conclusions.

I think part of what makes HTM difficult to use and hard to adopt is the fact that it’s hard to get useful information out of it. An HTM network is forming its own representation of your data, and doesn’t care about making that representation easy for you to interpret. Which neuron will eventually learn which patterns is pretty much random.

So you might ask yourself; if neural activity is more or less random (in terms of which neurons map to which patterns), how does the brain perform complex actions? How does it coordinate neurons to fire in specific patterns?

The solution is reinforcement learning. This is something that Numenta seems to be putting off researching (probably because you can run into roadblocks when doing certain things that require all the sensory-motor inference stuff they’ve been working on over the past few years). It’s quite clear that the part of the brain performing this is most likely the Basal Ganglia, and I happen to have written a pretty lengthy set of posts a while back about how well-studied circuitry in the Basal Ganglia likely interacts with HTM to create this.

Reinforcment Learning, put simply, is training a machine learning algorithm to solve a task by giving it a “reward / punishment” signal. If it’s doing a good job, you give it a “reward” signal that tells it to keep doing what it’s doing. If it’s doing a bad job, you give it a “punishment” signal that makes it less likely to respond that way in that situation in the future.

For example, if you’re trying to get HTM to learn to predict branches, you need some population of neurons that fire when you expect the branch to be taken and another that fire when you expect it to not be taken. There may also be a group of neurons that don’t impact this at all (this happens in the brain and likely keeps track of patterns that don’t have a strict impact either way, but are still necessary for good predictions). What you need to do is introduce bias signals, likely through targeted inhibition, to get the outputs you want. Then to make a decision, you just count the number of neurons in each group that are firing. If there are more “taken” neurons firing than “not taken” neurons, then you predict that the branch will be taken, and vice-versa.

The way I suggest it in that post would require a second HTM layer (a hierarchy, I guess), for modelling a Striatum. This would probably be necessary to get prefetching working, but branch prediction could probably be done with just a Striatum layer (which is pretty similar, except that you have neurons segregated into 3 groups that respond to reward signals differently).

1 Like

oh, I didn’t know getting the prediction from the network was so much of a problem. If it’s an important issue by itself, it would make more sense to work without an additional layer on top of it (branch prediction or memory prefetching). What are your thoughts? Can a CS student work on it in less than 10 months? Even though it’s not based on biology, which makes things much harder.

I’ll read your posts on reinforcement learning with HTM. Thanks again.

Like I said, you could probably do without a second layer for branch prediction. Prefetching is a different story; the cost of a cache miss makes it a much bigger deal in most cases. It’s mostly just that Prefetching requires a lot more than just a binary (taken/not taken) output. It also is a bit harder because you have to prefetch data long enough before the CPU requests it that it’s there in time, but not too far in advance that it takes up space in the cache that could be used for more important data.

Not sure, maybe. I’m really bad with predicting project timelines.

Yeah, no problem. Those posts are quite old, and I refined the details a lot through the discussion with other users there. If anything there is confusing, feel free to ask for clarification.

1 Like

I see it. I was thinking about workarounds like the following:

Imagine an HTM system has learned the behaviour for energy consumption. Then, it can do anomaly detection. So we can get its best prediction for the next timestep by guessing different values and seeing which of them is less of a surprise for the network. Then, the process can be repeated with values around the best bets until a desired precision.

Did I misunderstand anything? Is it a silly suggestion? If it’s not, I’m pretty sure many people have already thought about it. Sure, it’s also a very expensive process.

No problem = )

Thanks.

With branch prediction you could probably just provide two results (taken, not taken), and have the hardware be able to check both in parallel. That would be probably be pretty cheap, and a lot of circuitry could likely be shared if designed well. It would probably work pretty well.

With prefetching, the outputs you’d need would likely be much more complicated. You’d need to predict exactly where potential prefetches could be ~300 cycles in advance. You’ll also need to have many going in-flight. I don’t see how that would work well.

Not to mention, how would you test the values? You’d have a large range of possible locations in memory that could be loaded in. Then you’d need to be able to either check them all in parallel (expensive!) or test them all sequentially (slow!).

I’m still not sure what the best set of commands that an HTM agent could issue for prefetching would be. I’ll think about it a bit more later.

1 Like

No worries at all and thanks for your summary it’s information for all posters.

Note: At the end of the day you will have to somehow try and experiment with the TM even at an emulator level only, this will give you more insights to how TM can predict patterns in your specific problem. My suggestions below are an oversimplification just to evaluate without going into so much details that need testing anyways.

It is possible for the TM to provide a solution if the patterns are guaranteed to repeat most of the time(see @Paul_Lamb’s examples above). Otherwise, TM will be useless.

So assuming it is guaranteed to repeat patterns then, in this case, the possible sequences would be 0->0, 0->1, 1->1, and 1->0. You could probably create a synapse with 2-bit permanence value per sequence (region) and increment/decrement them as the TM learns. So that would mean at least 2^4 bits space. Each local BP can be assigned with this region. You will then have a common circuitry that can perform one epoch of a TM (predict, increment/decrement). I have not tried this so be aware.

I do not know your HW requirements but using instructions/data as part of the input will blow up the size requirements, however, some techniques out there may be utilized to conserve space. It really depends on your requirements.

1 Like

@Charles_Rosenbauer right, it’s for sure not an option for memory prefetching, but I’m glad to know it can work on some other domains.

Memory prefetching looks very hard, actually.

@Jose_Cueto

=)

It depends on the code and data, but the ideia is to take some advantage when a pattern do exists. Common patterns are actually [1111…] or [0000…], so I don’t expect any huge gains here. Now I’m just expecting not too bad results haha.

Yeah, I guess it’s really time to test things out.

1 Like

The TM transitions to a state for every epoch. When it receives an input (e.g. 1 or 0) it either strengthens/weakens its synapses and transitions to the next state. I’m using CS terms here to simplify things, the state is basically a representation of the TM synapses. Depending on how the TM is wired, and the encoding of inputs, there are at least 2 possible side-effects that can happen when an input is received. One is that the previous pattern learned can be unlearned after a few sequence of mutually exclusive inputs(this is what @marty1885 meant about cancelling I believe), second is that a new pattern can be learned without unlearning the previous learned patterns after a few sequence of mutually inclusive inputs (this can be temporary only). These properties are important to know so that one may know how the TM may perform at some level give a particular set of inputs.

2 Likes

These seem like important properties, thanks for explaining them better. I don’t know how much of an impact it can cause, since I also don’t know the inputs very well, but it will be important to be aware of these issues when doing the tests.

Alright, after much time I’m here to announce I did it. My graduation thesis had the following title: “Investigating Hierarchical Temporal Memory networks applied to dynamic branch prediction”. I was approved with a good grade but I’m not really sure how well I’ve done since I’d think I need some feedback from people that know HTM profoundly, and none of the professors in my board of examiners knew anything about HTM prior to my work.

I will already say that the results themselves were not very good, but I’d like to know how well I’ve done trying to explain the reasons for that and how relevant the work is to the HTM community. Here’s the link if anyone is willing to give a look: https://drive.google.com/file/d/1HGO25YvkP3UQ8jV_MEoUFowZBRyphMSe/view

I know there’s too much to read, so I’ll suggest some main topics:
1 INTRODUCTION
2.1.3 Types of branch instructions and how to predict them
2.1.4 Challenges in the design of dynamic branch predictors
2.1.5 Main dynamic branch prediction techniques, except after 2.1.5.3 Combining global and local branching histories maybe
2.1.6 Tolerating prediction latency and making multiple predictions per cycle
3 GENERAL OPERATION OF AN HTM BRANCH PREDICTOR
4 RESEARCH METHODS
5 RESULTS
6 DISCUSSION

It’s still too much, but yeah auehuaeh

I’d love to get your feedback.

@marty1885 I’m sorry for only updating you this late.

8 Likes

Really nice!

Perhaps, encoding might be improved. Take a look at Fig 3 in

Besides to be HW friendly (doesn’t need tables), it might be more flexible.

2 Likes

Thank you!

First of all, thanks for the suggestion. I have not implemented this encoder, but I thought about it a bit and I don’t think it would be possible to perform this encoding process as fast as I would need it for this application. Computing the pseudo-random generator on-the-fly and several times for each SDR seems to be very complex to optimize for. Please, correct me if I’m mistaken.

I was not even thinking about a table, but a hardwired solution. Since the synapses in the network are much more expensive to implement than the encoding process, I didn’t think it would be a problem to use this method.

A HW random generator can be really simple (v.gr.using xor gates) [althouhg not very good]. In any case, TM will take you >100 of clock cycles. Since the CLA algorithm is easy “pipelineable”,… the encoding should be not an issue.

The benefit of this is that it can increase the history of the PC substantially.

That might be impractical if you want some flexibility. For example, if you need static partitioning (let say because you are using SMT). Additionally, wiring is not free. It might require less area but certainly will have a negative impact on power.

In any case, the work is really nice.

1 Like

Oh, are you sure there is not a way to make it faster? Even if it’s “pipelineable”, I don’t think it would ever make sense to even think about using TM for branch prediction if it takes that much time to run.

That’s true, but according to my tests, the use of more history does not provide that much benefit = /

I’m sorry, I’m not familiar with this.

Yeah, it would require some really nice improvements in materials in order to work well.

Thank you very much! I’m glad you liked it.

Don’t think so. With 1024 mini-columns there is a lot of work to do each cycle. If you have a large budget in area and power, it could be improved, but hardly it will be faster than that. In any case, the LSTM paper shows this as a “theoretical” experience… my understanding is that your assumptions are in the same direction. [ And LSTM inference will be millions of clock cycles per “prediction” :slight_smile: and with offline training]

The only way to make it faster is to use smaller systems (few tens of mini-columns) and have a hierarchy :wink:

SMT stands for Simultaneous Multithreading. Some processors, such as Intel’s, statically partition per thread branch predictor tables (and history register) when you enable SMT (Intel call it Hyper-threading).

2 Likes