I want to use HTM on gpu. is there any available packages for using gpu?
There’s no official library. But the community is working on a few implementations, which you can help out with:
I’m working on an opencl version: https://GitHub.com/JonnoFTW/htm-cl
There’s also a tensorflow version:
Both of these are still a way off being finished though.
Actually, I don’t think the tensorflow version uses the gpu. The tensorflow op examples show seperate Cuda files that include work group information, but the htm tensorflow code doesn’t seem to have any of that.
You might have the only gpu htm code right now.
@SimLeek thanks for the clarification, I’m not familiar with TF. Currently my version is way off in terms of replicating functionality, implementing the major algorithms are the current goal:
- Spatial pooler with global inhibition
- Temporal memory
- SDR and CLA Classifiers
You’re very welcome.
CLA algorithm is not very friendly for DLP (data level parallelism) and not requires computing intensive ops (such as multiplications). DNN have both properties (and because of it GPUs make sense there).
IMHO, running CLA, GPUs will be less efficient/fast than conventional CPUs.
What about when using localization?
I can see DLP not being there for networks not using local inhibition, learning, etc., but it should work fine for both networks with localization and a lot of small networks. I don’t know whether networks with localization would have intensive ops such as lots of multiplications though, unless localization is calculated realistically instead of with squares/cubes.
I think GPUs will be much more efficient in cases using localization or multiple networks, and those are useful for vision and many other spatial problems. I’m interested in some of those problems, so I’d be interested in using a GPU for nupic eventually.
You might be right but…
I think that the problem is not the lack of spatial locality (… and distal activity still should be global). In fact, most GPUs don’t use on-chip networks capable to take advantage of it. In any case the interconnection network might be a small fraction of the total power of the GPU. The problem is the sparsity of the activity: a GPU needs dense (and regular) activity in order to be efficient. You might conceptualize it as a “massive” vectorial unit (in fact, it is a vectorial unit with gather and scatter). Therefore, you need a large number of operations “per cycle” in order to keep the functional units busy. Higher utilization of the functional units means lower Watts per operation (because idle functional units still consume power).
Additionally, the operation of the CLA is most of the time comparisons (check if a segment is active or not, and compute overlaps). This is really easy to do (in contrast with FP the multiplications needed by DNN). That means, that processing cores in the GPU will be most of the time accessing memory. The GPUs aren’t particular good for that assumption. Because of it, its on-chip memory hierarchy is rather small. For example, latest P100 has only 4MB of LastLevelCacle .
If you are using multiple regions (hypercolumns?) my experience is that TLP (thread level parallelism) seems to fit just fine. You can run each region in a separate core. They must synchronize only at the end of the epoch (which is a tiny fraction of the execution time). This allows a really good scaling. It could be easily done just by using thread pool design pattern and can be extended for multi-node clusters using message passing.
Perhaps other accelerators such as the Xeon Phi might make sense but I cant see how to deal with the high inter-core communication that you need there.
That’s a pretty good point.
However, it might tend to me much less memory intensive at the last level of computation than you’d think. CPUs have branch prediction because instructions going into it tend to be much more regular than random, so it’s possible a network on the GPU could take advantage of the same phenomena, placing nearby neurons to the ones activated on the later stages of memory. Also, using operations where some solid connections and neurons are stored as single bits could lower storage while increasing processing. A 4MB LastLevelCache could calculate a lot of 1 bit neurons and connection activations in TM (about sqrt(482^20) = 5792, in the case of a densely connected network of neurons), which would work if learning was turned off.
I’ll admit, I’m speculating about hacking something together specifically for GPUs at this point, but I still think it’s worth looking into. However, if using OpenCL with a GPU fails, I remember there was a company using OpenCL for a certain line of FPGAs, which might be worth looking into as well. Though, now that you mentioned the Xeon Phi processors, I think that might be worth looking into as well.
Right now though, I already have a GPU, so I might see about helping @Jonathan_Mackenzie with their library. Then I can see if there are any cases where a GPU offers significant speedup, and why it does/doesn’t work. Thanks for the whitepaper though. I’ll probably read through a good bit of that.
As far as I know, The SpiNNaker Chip has a very good infrastructure for integrating NUPIC SW, and it is very nice for focusing on it.
That would be super helpful. Currently I’m trying to figure out ways of ensuring that I’m extracting more performance out of the GPU than we can achieve with the CPU. The skeleton work is there for all the classes and I’ve written a few tests to compare different implementations of the various algorithms, which is pretty much necessary. Here’s a run down of how I want the code to look:
- All buffers stay on the GPU so we just pass around buffer pointers.
- Only copy data off the GPU when necessary or when the operation would be faster on CPU.
Into the future, I can see GPU outperforming the CPU implementation on VERY large layer sizes, but given the standard region sizes of 2048 columns, it remains to be seen if GPU will be superior. Or perhaps it’s just a matter of flipping the algorithm around so that the work can be subdivided more efficiently to take advantage of the 64 work-item wave front in OpenCL, since a lot of operations in nupic require branching if you skip a lot of preprocessing.