Using HTM for branch prediction


#21

I don’t know if operand values would help or if they would make some patterns even more difficult to learn.

Also, the TM would have to make a prediction looking at data from several instructions behind the branch, since that’s the only data it would have access to in the moment it needs to make the prediction. The hardware would also get a lot more expensive, as you mentioned.

I’ll think longer for what kind of context could be used.


#22

I believe you have to use a spatial pooler for this to recognize patterns from the operand/opcode combination and generate more consistent input patterns for the TM.

If you have the knowledge about the task of manually reverse-engineering executable code for some purpose such as cracking a license key or password for example, then the involved inference process of this task is actually similar to what I’ve suggested. The whole idea is to provide more context in the input and increase the chance of repeating patterns. I’m just thinking out-of-the-box as IMO using HTM to solve this problem is also an out-of-the-box type of thinking in the first place. I’m even curious if given the idea I’ve mentioned, instead of predicting if the branch will be taken or not, why not predict the next address jump and from this branching can be inferred quite easily? Again I’m thinking out-of-the-box and just throwing ideas here.

The TM already does this regardless of inputs. The way I look at the TM is like a classical finite-state automaton that reconfigures its transition function in real-time. The reconfiguration is an emergent behavior due to its learning mechanism based on HTM theory. What it learns is the transition function of another automaton that is capable of classifying an input spatially (without human meanings). In a full HTM system’s case (with SP) the learned automaton is the spatial pooler (SP), SP is engineered to be robust in representing patterns spatially (read more on SP and SDRs). In the 1’s and 0’s case, the SP is absent (at least in this discussion) and that the branching history table is the input automaton or at least represents the transition function. State 1 = 1, State 2 = 0, Input = Latest sequence of branching decisions (1’s and 0’s with width W). What if this input automaton is lacking states to represent the branching decisions at a specific point in time? In this case the TM will keep on forgetting and learning repeatedly and would be useless. On the idea of using operand/opcodes, there will potentially be more states that can represent the next address/instruction and thus branching can possibly be inferred. On the HW issue, given the same idea, there will be a trade-off between space and size of inputs, more space is needed to store the opcode/operand (depending on the encoding), but it could be that the size of the TM input will be smaller due to better representation of states.


#23

Hey! I know a fair bit about HTM, and know a fair bit about microarchitecture! I’ll throw in my two cents here! Feel free to ask me any questions as well!

First off, it seems like most people here don’t have a great idea of how modern branch predictors work, and what actually makes them work as well as they do.

These predictors actually don’t just consist of one predictor. They’re actually a large table of thousands of predictors normally. Every branch gets mapped to a hash table entry based on its memory address, which then learns patterns for that specific branch (and any others that get mapped to the same entry).

A hash table of 2-bit saturating counters (literally working based on “which way has this branch gone the past couple times it appeared?”), can actually get up to about 93% accuracy on most code.

There are more sophisticated systems as well. For example, most Intel CPUs from the past decade or so have used Two-Level Predictors. Instead of the address mapping to a single predictor in a 1D table, it maps to a row of predictors in a 2D table, and the path taken from the previous N branches (the history) is used to select exactly which predictor from that row (which column).

The reason why AMD is using neural networks now is because the whole problem here is trying to get as much predictive power out of as little circuitry as possible. The problem with Two-Level Predictors is that the size of each row in the table scales exponentially with the length of the history (for N history bits, you need 2^N predictors per row). Meanwhile, if you throw a perceptron at the problem, you can get similar predictive power while requiring only linear scaling (one weight per history bit). It’s more expensive to implement per predictor, but it brings things back to linear rather than exponential scaling.

It’s also important to remember that these aren’t very sophisticated neural networks. Basically they’re just “take the last N bits, multiply each bit by a corresponding weight, sum them up, and check if they’re positive or negative. Positive = branch taken, negative = branch not taken. Adjust weights accordingly based on the actual result.” No deep neural networks, no fancy backprop, just a single-layer perceptron. Just a weighted sum of the last N branches taken.


It’s also important to bring up some other aspects here. There are many cases where branches are unpredictable. Say we have the following code:

array arr;
for(int i = 0; i < length(arr); i++){
    if(arr < n){
        //do something
    }else{
        //do something else
    }
}

If the values of arr are randomly distributed around n, then the branch predictor can’t do anything to learn this because the pattern to be learned isn’t in the code, it’s in the data. Any benefit you gain here is basically from the branch predictor equivalent of overfitting, and even then it’ll only actually work if arr is short and frequently iterated over.


So it’s not really a matter of learning that “the code goes 0011001100110011 through branches”. It’s about learning complex patterns of how each branch in the code is likely to influence each other branch. And, doing all of that with as few transistors as possible.

If you wanted to use HTM for branch prediction, here’s how it would have to work:

  • The HTM network should be quite compact; no wasted transistors! You can’t throw a million-neuron network with 10k synapses per neuron at this problem. You might only be able to manage a few hundred neurons each with a couple hundred synapses before this thing gets more expensive than even the largest traditional predictors. If anyone here is willing to learn some Verilog or VHDL, you can try making one.

  • Encoding the history bits isn’t enough. You need to be able to give the predictor hints about exactly where in the code the current branch is. Some other hints, such as a few bits from registers and what kinds of instructions are around might be useful too, though branch predictors never take those into account as far as I’m aware.

  • You’ll need to have some way to rewind the network to a previous state in case of a mispredict. Modern CPUs are quite deeply pipelined, so it might not know that a branch was mispredicted until after it’s already predicted the 5-10 branches that come afterward. Not to mention, the whole reason for making more accurate predictors is to make longer pipelines more feasible.

  • You’ll have to get useful data out of the network as well regarding exactly which way to go, and do this on a tight transistor budget. This might be a case where reinforcement learning has to come in.

I’m not confident that HTM would actually provide a large competitive advantage here, but hey, if anyone wants to learn some Verilog/VHDL and embed it in a RISC-V sim or something, I’d love to see it. If it is useful, it’ll probably be from making it easier to incorporate other information from around the CPU into making better decisions.


Then of course, all of this is completely ignoring Branch Target Buffers, where HTM would likely provide much less of an advantage, and Memory Prefetching, where HTM would likely actually do a better job (if done right), and would make a far bigger impact than a better branch predictor. After all, a branch mispredict costs 15-20 cycles. A cache miss costs about 300.


Deep Predictive Learning: A Comprehensive Model of Three Visual Streams
#24

Maybe but I disagree, I think most people just don’t have the time to define the problem space (esp. if its for a thesis or homework) as it is out of this forum’s scope. IOW it is the OP’s homework to define the problem space and present it and iterate on it as needed. In this case you have done a lot on this part of the OP’s homework. The OP should thank you for it.


#25

@Jose_Cueto and @Charles_Rosenbauer (as there are some overlapping ideas from both of you)

Until now I was just thinking to use the TM in order to get those patterns that could be spotted by a human just by looking at it in not much time. The ideia of using more not-so-human-friendly data to make these predictions would require a SP and a different way to represent the output, right? More like a continuous output, rather than a binary one.

Also, I’ll better explain the problem with using these data I was refering to: on a local predictor, nearby instructions data would not be useful, since they do not change. So only registors data could be useful (because even cache would delay too much to deliver anything), but by the time the predictor needs this data, the content on them do not refer to that specific instruction, but to some instructions before (the ones further in the pipeline), so the TM would probably have to see a indirect pattern here (sorry, I think it still is confusing, hope you understand it). However, a global predictor could perhaps make use of the instructions to get some insights. It would be interesting to test.

@Charles_Rosenbauer Thank you so much for the detailed post. Your comment has actually hurt my hope on getting interesting results (or any at all), but the earlier I discover this, the better, so I can even change the topic of my thesis if needed. I knew most of what you wrote on the first paragraphs, but the last takeaways were really valuable.

So, I have some questions. I don’t expect you to answer them with so many details (as @Jose_Cueto said, it’s my job to make further investigations, I just need a path =])

Q: Is it possible that a TM can replace a Two-Level Predictor table with less hardware and more of less the same results, but learning in less cycles than a perceptron?
Expected answer: No, a TM would require more hardware.

Q: Is it too difficult (or even impossible) to rewind a TM? So, is the solution for rewinding be using copies of all the parameters of the model? If it is, hardware will be definely a great problem.
Expected answer: there is a way, but it would be very difficult to implement (isn’t default TM too difficult already?).

Q: Since I expect ‘No’ for the first question, a TM would only make sense as a global predictor, right? One that could be reused, forgetting about instructions not used in some time). My idea is to combine a 2-bit saturating counter to help the TM when it’s not so sure (or, from other perspective, use a TM to improve the 2-bit saturating counter when possible)

Q: (From above) The ideia of using more not-so-human-friendly data to make these predictions would require a SP and a different way to represent the output, right? More like a continuous output, rather than a binary one.

Q: How can Memory Prefetching be modeled as a temporal problem? (don’t waste too much time on this one haha) [EDIT: @vpuente has linked an interesting paper below. I think I can find the answer reading it beyond the abstract, which I’ll do later]

Thanks so much for everyone that is taking some time to guide me on this topic.


#26

I would look at prefetchers. LSTM seems to work there ( https://arxiv.org/pdf/1803.02329.pdf ). HTM should be better at that. My experience is that HTM is reasonably good predicting the PC, but load addresses stream can have a superior impact on performance than BP. Be aware that conventional BP are already pretty good.

PS: a “small” HTM can be done in HW. I think that an LSTM is impossible (at least from the perspective of the core).


#27

Depends on how long the history is. If it’s a small one, no. 8 history bits and 2048 entries I think are pretty common numbers for Two-Level Predictors in modern, high-end processors. That’s 128kB of internal state, or ~6.3 million transistors, which is quite a bit to play with. If you have an HTM network with say, 256 neurons, 512 synapses per neuron, and 3 bits per synapse, that would require ~2.4 million transistors for the memory. Not sure about the logic, but you could probably make it work.

I’m not sure. I think what you could do here would be to only actually adjust synapse weights after a branch is retired (meaning that it has been verified to have been correct). The advantage of this would be that you’d only have to store which neurons were active during each timestep (timesteps probably being every point when a new branch showed up). In that case, if your pipeline can handle say, 15 branches in-flight, you’d have to store all the neuron activity from the previous 15 or 16 branches. That’s not much; if you only have 256 neurons, and you need 1 bit per neuron per timestep, that’s ~25k transistors for 16 timesteps. Practically nothing compared to the rest of the predictor.

The only potential problem is the impact that it might have of not learning until after a branch is confirmed to have been correct, though my intuition tells me this might be negligible. Especially because if it got the branch correct, then it’s model probably doesn’t need too much adjusting. If it got the branch wrong, it’ll have to throw away all the subsequent branches anyway, so it doesn’t matter what it predicted for them.

I haven’t heard of any CPUs that have done much in terms of global predictors before. Most CPUs will have a separate predictor for each core because switching to another program is a great way to scramble all the data in your predictor, slowing your program to a crawl.

That said, perhaps an HTM network with sufficient capacity could do a better job there. The main problem is latency. CPUs have multi-level caches for a reason. Grabbing a value from an L1 cache might take 3 cycles. L2 might take 15. L3 might take 50. RAM will take ~300. The problem is that the further away something is from a core, the higher the latency of the communication.

It’s likely that a global predictor on a modern CPU would take way too long to get a response from to be useful for branch prediction.

If you have some way of getting the data out of HTM, I don’t see any reason why you couldn’t just use normal HTM. If you can get reinforcement learning working on an HTM network, everything should be just fine.

Well, if you can get reinforcement learning working with HTM, you could train it to issue prefetches based on the current instruction pointer, recent successful prefetches, etc. Have it so that it can issue prefetches based on a certain range from the most recent load perhaps. Basically, just have it select the offset from the most recent load/prefetch. Not entirely certain what the best reward function would be, but it should probably take into account for both cases where the prefetcher fails to grab the right data, and when it grabs data that isn’t needed. Both are situations that are good to avoid.

The impact of a prefetcher would probably be quite larger than that of a branch predictor, especially as we start scaling to larger numbers of cores; scaling core count is much more easier than scaling memory bandwidth, so managing on-chip memory more intelligently is generally a good idea.


#28

Ok so we are working on a local BP cheers for mentioning it. So if I may ask just to clarify things, given your ideas on how to use TM (ignoring HW roadblocks first), what is/are your input to the TM? What’s on your mind so far, could you please explain? The reason why I’m asking is to evaluate again if this input might be enough to show moderately lasting sequences which can be predictable by a TM.

You’ll need an SP if you need to classify your inputs into more robust representations and allow the TM to work on these inputs effectively.


#29

@Charles_Rosenbauer thanks a lot, specially on the notion of necessary hardware and tips for investigating Memory Prefetching. I still don’t know what you mean when you talk about using reinforcement learning to solve some problems, but that’s because I haven’t done the necessary reading.

I’ll definitely look more into Memory Prefetching. If it’s a better suited task for an HTM system, I’m willing to switch my focus since it’s a related topic somehow.

@Jose_Cueto I’m sorry, various topics were pointed out through this thread, so the ideas I had in mind were iterated over without a proper explanation for each of them.
I’ll make a summary of the final forms of the ideas brought into the discussion until now and my conclusions about them.

Idea #1: multiple local predictors, one for each branch (up to a a certain limit). Input: 1 or 0 (branch taken or not). The prediction should not be difficult to extract from the model. These HTM implementations should be very small and optimized in order to compete with a Two-Level Predictor. Conclusion: too much hardware would be needed.

Idea #2: a single global predictor for all branches. Input: instruction addresses. The prediction would not be as easy to extract. It would allow for a bigger HTM implementation, but maybe not enough. Conclusion: multithreading would ruin patterns unless the predictor keeps track of each thread and deal with them separately, which probably would result in multiple predictors actually, each of them predicting for one thread. Again, probably too much hardware.

Idea #3: one single predictor working with local data. Input: instruction addresses and local branching history. The patterns would not be trivial to spot, and the prediction would also be harder to extract. No conclusions.

Idea #4: one single predictor working with global data. Input: instruction addresses, maybe local branching history, nearby instructions, registers data, etc. Same caracteristics as #3. No conclusions.


#30

I think part of what makes HTM difficult to use and hard to adopt is the fact that it’s hard to get useful information out of it. An HTM network is forming its own representation of your data, and doesn’t care about making that representation easy for you to interpret. Which neuron will eventually learn which patterns is pretty much random.

So you might ask yourself; if neural activity is more or less random (in terms of which neurons map to which patterns), how does the brain perform complex actions? How does it coordinate neurons to fire in specific patterns?

The solution is reinforcement learning. This is something that Numenta seems to be putting off researching (probably because you can run into roadblocks when doing certain things that require all the sensory-motor inference stuff they’ve been working on over the past few years). It’s quite clear that the part of the brain performing this is most likely the Basal Ganglia, and I happen to have written a pretty lengthy set of posts a while back about how well-studied circuitry in the Basal Ganglia likely interacts with HTM to create this.

Reinforcment Learning, put simply, is training a machine learning algorithm to solve a task by giving it a “reward / punishment” signal. If it’s doing a good job, you give it a “reward” signal that tells it to keep doing what it’s doing. If it’s doing a bad job, you give it a “punishment” signal that makes it less likely to respond that way in that situation in the future.

For example, if you’re trying to get HTM to learn to predict branches, you need some population of neurons that fire when you expect the branch to be taken and another that fire when you expect it to not be taken. There may also be a group of neurons that don’t impact this at all (this happens in the brain and likely keeps track of patterns that don’t have a strict impact either way, but are still necessary for good predictions). What you need to do is introduce bias signals, likely through targeted inhibition, to get the outputs you want. Then to make a decision, you just count the number of neurons in each group that are firing. If there are more “taken” neurons firing than “not taken” neurons, then you predict that the branch will be taken, and vice-versa.

The way I suggest it in that post would require a second HTM layer (a hierarchy, I guess), for modelling a Striatum. This would probably be necessary to get prefetching working, but branch prediction could probably be done with just a Striatum layer (which is pretty similar, except that you have neurons segregated into 3 groups that respond to reward signals differently).


#31

oh, I didn’t know getting the prediction from the network was so much of a problem. If it’s an important issue by itself, it would make more sense to work without an additional layer on top of it (branch prediction or memory prefetching). What are your thoughts? Can a CS student work on it in less than 10 months? Even though it’s not based on biology, which makes things much harder.

I’ll read your posts on reinforcement learning with HTM. Thanks again.


#32

Like I said, you could probably do without a second layer for branch prediction. Prefetching is a different story; the cost of a cache miss makes it a much bigger deal in most cases. It’s mostly just that Prefetching requires a lot more than just a binary (taken/not taken) output. It also is a bit harder because you have to prefetch data long enough before the CPU requests it that it’s there in time, but not too far in advance that it takes up space in the cache that could be used for more important data.

Not sure, maybe. I’m really bad with predicting project timelines.

Yeah, no problem. Those posts are quite old, and I refined the details a lot through the discussion with other users there. If anything there is confusing, feel free to ask for clarification.


#33

I see it. I was thinking about workarounds like the following:

Imagine an HTM system has learned the behaviour for energy consumption. Then, it can do anomaly detection. So we can get its best prediction for the next timestep by guessing different values and seeing which of them is less of a surprise for the network. Then, the process can be repeated with values around the best bets until a desired precision.

Did I misunderstand anything? Is it a silly suggestion? If it’s not, I’m pretty sure many people have already thought about it. Sure, it’s also a very expensive process.

No problem = )

Thanks.


#34

With branch prediction you could probably just provide two results (taken, not taken), and have the hardware be able to check both in parallel. That would be probably be pretty cheap, and a lot of circuitry could likely be shared if designed well. It would probably work pretty well.

With prefetching, the outputs you’d need would likely be much more complicated. You’d need to predict exactly where potential prefetches could be ~300 cycles in advance. You’ll also need to have many going in-flight. I don’t see how that would work well.

Not to mention, how would you test the values? You’d have a large range of possible locations in memory that could be loaded in. Then you’d need to be able to either check them all in parallel (expensive!) or test them all sequentially (slow!).

I’m still not sure what the best set of commands that an HTM agent could issue for prefetching would be. I’ll think about it a bit more later.


#35

No worries at all and thanks for your summary it’s information for all posters.

Note: At the end of the day you will have to somehow try and experiment with the TM even at an emulator level only, this will give you more insights to how TM can predict patterns in your specific problem. My suggestions below are an oversimplification just to evaluate without going into so much details that need testing anyways.

It is possible for the TM to provide a solution if the patterns are guaranteed to repeat most of the time(see @Paul_Lamb’s examples above). Otherwise, TM will be useless.

So assuming it is guaranteed to repeat patterns then, in this case, the possible sequences would be 0->0, 0->1, 1->1, and 1->0. You could probably create a synapse with 2-bit permanence value per sequence (region) and increment/decrement them as the TM learns. So that would mean at least 2^4 bits space. Each local BP can be assigned with this region. You will then have a common circuitry that can perform one epoch of a TM (predict, increment/decrement). I have not tried this so be aware.

I do not know your HW requirements but using instructions/data as part of the input will blow up the size requirements, however, some techniques out there may be utilized to conserve space. It really depends on your requirements.


#36

@Charles_Rosenbauer right, it’s for sure not an option for memory prefetching, but I’m glad to know it can work on some other domains.

Memory prefetching looks very hard, actually.

@Jose_Cueto

=)

It depends on the code and data, but the ideia is to take some advantage when a pattern do exists. Common patterns are actually [1111…] or [0000…], so I don’t expect any huge gains here. Now I’m just expecting not too bad results haha.

Yeah, I guess it’s really time to test things out.


#37

The TM transitions to a state for every epoch. When it receives an input (e.g. 1 or 0) it either strengthens/weakens its synapses and transitions to the next state. I’m using CS terms here to simplify things, the state is basically a representation of the TM synapses. Depending on how the TM is wired, and the encoding of inputs, there are at least 2 possible side-effects that can happen when an input is received. One is that the previous pattern learned can be unlearned after a few sequence of mutually exclusive inputs(this is what @marty1885 meant about cancelling I believe), second is that a new pattern can be learned without unlearning the previous learned patterns after a few sequence of mutually inclusive inputs (this can be temporary only). These properties are important to know so that one may know how the TM may perform at some level give a particular set of inputs.


#38

These seem like important properties, thanks for explaining them better. I don’t know how much of an impact it can cause, since I also don’t know the inputs very well, but it will be important to be aware of these issues when doing the tests.