Speed of Spatial Pooler and Temporal Memory

I’ve hacked together a proof of concept script from the hot gym example using the HTM community fork of htm.core. My goal is to eventually use HTMs to look for anomalies in multiple streams of data. The first hurdle I have come across is that htm.core is very slow. Here is a single step of the learning / predicting process:

    def _time_step(self, index: int, s: float) -> None:
        """
        Takes a time step in the data.

        Parameters
        ----------
        index : int
            The number of this time step.
        s : float
            The value of the signal.
        """
        t0 = time.time()
        # create an SDR structure for this data point
        activeColumns = SDR(self._pipeline['pooler'].getColumnDimensions())
        t1 = time.time()
        self.results['t_active'][index] = t1 - t0

        # encode the value s
        encoding = self._pipeline['encoder'].encode(s)
        self._metrics['encoder'].addData(encoding)
        t2 = time.time()
        self.results['t_encode'][index] = t2 - t1

        # execute spatial pooling over the encoded input data
        self._pipeline['pooler'].compute(input=encoding, learn=True,
                                         output=activeColumns)
        self._metrics['pooler'].addData(activeColumns)
        t3 = time.time()
        self.results['t_pool'][index] = t3 - t2

        # execute temporal memory over the active mini-columns
        self._pipeline['memory'].compute(activeColumns=activeColumns,
                                         learn=True)
        activeCells = self._pipeline['memory'].getActiveCells()
        self._metrics['memory'].addData(activeCells.flatten())
        t4 = time.time()
        self.results['t_memory'][index] = t4 - t3

        # ---- predictions ----
        # predict what will happen for each horizon
        pdf = self._pipeline['predictor'].infer(activeCells)
        for steps in pdf.keys():
            if pdf[steps]:
                self.results['predictions'][steps][index] = (
                        np.argmax(pdf[steps])
                        * self.args['predictor']['resolution']
                )
            else:
                self.results['predictions'][steps][index] = float('nan')

        tma = self._pipeline['memory'].anomaly
        self.results['anomalyScore'][index] = tma
        self.results['anomalyProb'][index] = \
            self._pipeline['anom_history'].compute(tma)
        t5 = time.time()
        self.results['t_predict'][index] = t5 - t4

        # let the predictor learn
        self._pipeline['predictor'].learn(
            index, activeCells,
            int(np.abs(s) / self.args['predictor']['resolution'])
        )
        t6 = time.time()
        self.results['t_learn'][index] = t6 - t5

        self.results['runTime'][index] = t6 - t0

I’ve timed each of the steps, which you can see in the code above. When I plot the time to run each step, I see the following:
Figure_2
(The x-axis should be microseconds, not milliseconds.)

It is clear that the runTime is dominated by the Spatial Pooler (t_pool) and Temporal Memory Steps (t_memory), which are at least an order of magnitude larger than the other steps. Since the HTM school videos only have vague descriptions of the mathematics of HTM learning, I have little idea what is going on under the hood here. Is it possible to speed up these two processes by 10-100X?

1 Like

Hi,

I think that to get a 100x improvement in speed you will need either different algorithms or better hardware. A 100x speed is usually not attainable from small software optimizations.

Two projects in particular come to mind:
(disclaimer, I have not used either of these)

  • BrainBlocks advertises that it has novel algorithms which run faster while doing essentially the same thing as an HTM.
  • Etaler advertises that it can run an HTM on a graphics card.
3 Likes

I tried to just remove the spatial pooler, because someone in one of the threads you reference says they didn’t see much difference when using / not using them.

Here is my original plot, with fixed y-axis limits:
Figure_2

Here is the plot with the spatial pooler removed (but still timed):
Figure_3

The big difference is that the temporal memory step (t_memory) becomes much, much faster. The average computation time is reduced from 20 ms to about 10 microseconds (2000x faster). The bottleneck now appears to be the anomaly detection learning (t_learn) times.

I don’t have a working way to evaluate the performance of the two different algorithm, in part because it is so slow, but manual inspection of some results looks to me that eliminating the spatial pooler isn’t a big loss.

1 Like

I am confused. How are the ‘active columns’ computed without using the SpatialPooler compute?
Are they just random, or does the TM call similar dendrite de/activation code under the hood?

The spatial pooler is just an SDR in htm.core. If you are are using an SDR to encode your scalar, you can just supply the encoding SDR to the TM routine.

1 Like

Sorry, my knowledge of TM is quite limited. I understood that TM used the trained activations from the SP in a parallel set (a sequence) to do temporal prediction.
Does this mean that TM also updates the active columns and changes weights on compute?

The TM doesn’t update the active columns, those are chosen by the SP alone.
The TM does 2 basic things:

  1. Chooses the winner cell to activate within each active column

  2. Builds & updates the synaptic connections and their permanences between cells - this determines which cell(s) are made predictive by the activation of the winner cells

1 Like

Thanks. Very helpful.
If I understand that properly, the TM can’t function without at least one call to SP compute. Otherwise there are no active columns and hence no cells (within an active column) to win or be updated.

The TM does need a set of active columns – though it doesn’t necessarily need the SP to get them. You could just use the active bits from the encoding as the active columns, and bypass the SP entirely.

The encoding vectors are smaller & more dense than those output from SP (which does ~2000 columns & 2% active by default I think) but still contain distributed semantic meaning, so it is viable in theory as @TheFinn showed above.

Neat.
Rejoining the main thread now, perhaps that’s how BrainBlocks (and others) get some of their speed up. A highly optimised SP like function and probably very selective use of it, based on @TheFinn 's results (above).
However, I’ll go read their stuff instead of my random speculating.

NuPIC’s TM implementation is super slow. I think there’s a certain algorithm inefficiency somewhere. Both Brainblock and Etaler’s implementation seems to have a lower big-O complexity

The main performance bottleneck of HTM is memory bandwidth. At least for Etaler on both CPU and GPU. A large TM can easy use up all bandwidth of the memory bus (evident by hyperthreading not helping performance at all)

@danml @marty1885 @sheiser1 @TheFinn

Yes, our SP, which we call a PatternPooler (we don’t use any topology), is mostly not used since we have good encoders and push the output directly selecting the activated columns in our SequenceLearner.

The SP is essentially a remapping from one representation to another that is learned. Of course, getting a handle on that process, understanding how it works, and configuring it to your problem has been more trouble than its worth in my experience. It functions better as a model of biological processes, and less so an useful practical tool. Using directed encoded output just gives you better control and understanding of how your system is working, and you can even understand the outputs of temporal memory in a human intuitive sense.

In BrainBlocks we took out a lot of the biological idiosyncrasies and focused on the practical and human-understandable capabilities. Our PatternPooler is one of those things we’ve struggled to find an use for.

4 Likes