TM escalation by failure prediction


Here-s a proposal about how to stack several TM blocks at different time frames, in a manner similar to STREAMER (recently mentioned here) and Ogma’s SPH as a means to:

  1. deal with limited learning capacity of any single TM block.
  2. having an upper TM block operate at slowed time frame as the block(s) below it.
  3. incremental learning by ascending the hierarchy
  4. as a bonus it would provide a means to “shield” a TM block from unfamiliar patterns.

(I will detail 1…4 above after I’ll explain how it works)

First, a short reminder of a TM block architecture/function - it is organized in a matrix of columns X cells, one column per input SDR field, each column stacking a fixed number of cells, each cell learning to anticipate whether its corresponding input SDR bit will be 1 (turned ON) at the following (t+1) time step. For e.g.'s sake we have a 1000col x 30cell TM block.

Failure to correctly predict any column state at (t+1) is what drives the learning mechanism which produces changes within corresponding cells synapse connections via bursting (or boosting? I don’t recall the actual term).

An important notion to take away is the receptive field - which is where from any given cell can connect its input synapses to: the receptive field at any time t is the union of input sdr and cell activation states at previous time step (t-1)

Before cutting to the chase a (not too) short reminder what the general problems any ML (not only TM) model struggles to handle. (or skip to the next message if already bored)

  • sample efficiency and catastrophic forgetting - skip these for now since TM handles them pretty well.
  • handling out of training dataset (aka real world) cases. The obvious way TM avoids this issue is with continual learning, yet that in itself does not shield it from:
  • the trade between resource capacity (memory,compute) and complexity (in space and time) - which is commonly handled by scaling up: (use more hardware to) make a similar, bigger (hopefully) faster model and begin its training from scratch.

What aggravates it is:

  • it is hard to know in advance what level of resources are needed to handle decently (==economically) a certain problem.
  • the increment in memory (# of neurons/parameters/synapses/etc.) requires an excessive increase in training compute required to learn from a dataset.
  • problem complexity increase requires an excessive increase in training dataset size and training time.

Now, back to the TM block:

  • N bits of input SDR, one at a time - T0, T1, … ,Tforever, Tforever+1, …
  • N columns, H cells per column, N x H number of cells
  • N x (H+1) receptive field of any possible synapse within the block.

And, instead of scaling it up later, let’s begin by scaling it down (e.g. 500x10 matrix instead of 1000x30) which has the benefit of ~1 order of magnitude fewer resources to learn and the downside of limiting its learning capacity.

Yet if you’ve read the Great Book of SDR-ology sufficient times you would have learned that scaling down a SDR doesn’t entirely ruin it, and if you also have (as most data scientists) > 80% faith in Pareto’s Principle you would know the downscaled TM should catch up a significant amount of spatio-temporal patterns in the input SDR stream.

The good thing is a “baby TM” will learn fast the most obvious, less complex, low-hanging patterns, the not so good thing it will fail to learn more complex ones.

Now that we have a TM which saturates relatively fast, what next, instead of a bigger TM? (and bigger, faster machine to handle it)

Lets freeze the baby TM and add to its side only one baby watcher column, which

  • doesn’t necessarily has the same amount cells/segments/synapses
  • it has the same receptive field: baby TM’s activation state + input SDR at t-1
  • its input signal is 1 when the baby TM fails in its prediction, and 0 when it succeeds.

So the watcher column purpose is to learn and anticipate when the baby TM succeeds or fails.

What is this information useful for?

  • when it predicts failure it will skip evaluating the baby TM entirely, and escalate the current time step to a higher level handler TM block (see below what and when this sees)
  • which means shields the baby TM’s effort and anguish of handling input SDRs they cannot handle.
  • we have a routing switch for handling increasing complexity - baby sees mostly time frames it has learned to predict, while handler sees the remaining time frames.
  • very importantly when you feed a TM time frames it cannot anticipate that usually messes up its future predictions because feeding it “noise” changes its activation/prediction state. By skipping processing of “the unknowns in the universe” it will preserve its current anticipation for future knowns.
  • the top level handler TM will learn to (sic) handle what the baby cannot. And also important, it will not have to handle what the baby can. Which means we can double the learning capacity without doubling the compute.

And the game can continue, if/whenever second level TM, saturates a new watcher column and third level handler can be added on top of it. IF needed.

The handler TM block has in its receptive field both the input SDR, the baby’s columns last activation states, plus its own, handler’s columns last activation states.
So it’s complexity increases even without having more cells, but on the plus side it will only learn and respond to subset of the input SDR stream - those frames the baby TM cannot handle.

That’s about it for now… the “machine” can grow also wider but I won’t abuse your patience (maybe later)

Seems like a cool idea.