I think that to get a 100x improvement in speed you will need either different algorithms or better hardware. A 100x speed is usually not attainable from small software optimizations.
Two projects in particular come to mind:
(disclaimer, I have not used either of these)
BrainBlocks advertises that it has novel algorithms which run faster while doing essentially the same thing as an HTM.
Etaler advertises that it can run an HTM on a graphics card.