That's a pretty good point.
However, it might tend to me much less memory intensive at the last level of computation than you'd think. CPUs have branch prediction because instructions going into it tend to be much more regular than random, so it's possible a network on the GPU could take advantage of the same phenomena, placing nearby neurons to the ones activated on the later stages of memory. Also, using operations where some solid connections and neurons are stored as single bits could lower storage while increasing processing. A 4MB LastLevelCache could calculate a lot of 1 bit neurons and connection activations in TM (about sqrt(4*8*2^20) = 5792, in the case of a densely connected network of neurons), which would work if learning was turned off.
I'll admit, I'm speculating about hacking something together specifically for GPUs at this point, but I still think it's worth looking into. However, if using OpenCL with a GPU fails, I remember there was a company using OpenCL for a certain line of FPGAs, which might be worth looking into as well. Though, now that you mentioned the Xeon Phi processors, I think that might be worth looking into as well.
Right now though, I already have a GPU, so I might see about helping @Jonathan_Mackenzie with their library. Then I can see if there are any cases where a GPU offers significant speedup, and why it does/doesn't work. Thanks for the whitepaper though. I'll probably read through a good bit of that.