(Submitted on 28 Sep 2015 (v1), last revised 8 Oct 2015 (this version, v2))
In the decade since Jeff Hawkins proposed Hierarchical Temporal Memory (HTM) as a model of neocortical computation, the theory and the algorithms have evolved dramatically. This paper presents a detailed description of HTM’s Cortical Learning Algorithm (CLA), including for the first time a rigorous mathematical formulation of all aspects of the computations. Prediction Assisted CLA (paCLA), a refinement of the CLA is presented, which is both closer to the neuroscience and adds significantly to the computational power. Finally, we summarise the key functions of neocortex which are expressed in paCLA implementations.
Having implemented my own version of HTM, I am having a hard time identifying where exactly paCLA makes a difference according to the paper. From what I understand, 4.3 is the main section that should be doing this but it is discussed so interleaved with Numenta’s implementation that the actual difference gets vague. What I understand is that paCLA considers the distal predictions on top of feedforward overlaps before the inhibition so that there can be a bias towards the predictions on the remaining active columns after the inhibition. There is even a possibility of a column getting active without an actual feedforward input. While it is interesting, I read in a few places that the cortex tries to decouple the real signals from the biased representations internally and that it is important not to create a spiral towards biased reality that worsens over time ( Even though some part of me knows that this is already true for us ). I believe I read that this decoupling was done through separating feedforward and feedback pathways on the hierarchy itself. By the way, I would appreciate if you can point out the exact places in the sources that makes you believe this is closer to biology, if there are any obvious ones.
Oh and thank you for the effort on producing this work.
As I said in the thread this came from, nobody disputes that pyramidal cells combine both distal (recurrent and feedback) and proximal (feedforward) inputs every time they fire. As you say, this will somewhat alter the exact SDR produced for each feedforward input, which is one of the reasons why it’s difficult or impossible to change NuPIC to run paCLA.
As explained in the paper, however, paCLA allows the SDR to represent its inputs as two orthogonal vectors - one for the intersection of prediction and feedforward representation, and the other for the representation of feedforward disagreement with the predictions. In addition, the full version of paCLA has a number of cells active in bursting columns, which changes the sparsity of part of the cellular SDR dramatically. This complex vector representation is not available without paCLA, and I argue that this combination of a sparse predicted SDR and a less sparse bursting error SDR is the currency of HTM, eg. allowing Temporal Pooling cells to succeed when their inputs are sparse and fail when other cell’s inputs are bursting.
It’s not (explicitly) in the paper, but another consequence of the less sparse SDR output of paCLA is that it helps the downstream layer reacquire its lock on the sequence it should be in, again using paCLA. The downstream layer also has recurrent and feedback inputs (and possibly other feedforward inputs), and the many bursting inputs will tend to promote new cells which have the best combination of evidence, at the expense of previously pooling cells which are no longer correct. There is then a transient disturbance of the sparse pooled SDR, followed by a resettling towards a new one, helped by feedback to the lower layer which gradually reduces its prediction error and hence bursting. These ideas are explained at a more abstract level in my later paper, Symphony from Synapses: Neocortex as a Universal Dynamical Systems Modeller using Hierarchical Temporal Memory.
Isn’t this the case with Numenta’s implementation too? The implementation activates all the cells in a column when a feedforward activity is not predicted and there aren’t any matching (looser version of predictive) distal dendrites. Am I missing something? You underlined that paCLA has less sparse bursting while it has a couple of cells bursting compared to Numenta’s implementation which bursts all the cells in case of false negatives. Shouldn’t it be the other way around?
After reading the paper one more time, I haven’t seen any mentions about synapse bumps or cell boosts to help inactive cells become potential participants when they never get used. Is this the case or you just did not mention it? If it is the case, how do you create the competition to utilize the most out of all the columns?
I believe the reason for this is temporal pooling. My question would be, why do you explicitly need two vectors? As long as you know whether an active cell is bursting or predicted you can adapt the temporal pooling synapses. Is there any other usage for the two separate vectors?
By the way, I am not trying to find holes in the work, just trying to understand. I have been reading the mailing lists for over 2 years now and you are one of the few that I know by name because of your comments.
NuPIC activates all the cells in bursting (unpredicted) columns, but then chooses one to do learning on, and this cell only does distal learning (to reinforce its distal inputs). paCLA (not yet fully implemented) has a variable number of bursting cells, which are those whose total depolarisation exceeds that of the column (minus its bonus).
More importantly, NuPIC layers output the columnar SDR as input to the next Layer, so its sparsity is fixed. paCLA outputs the cellular SDR (from which you can derive the columnar SDR) which varies in sparsity by a factor varying by 1/cols to cellsPerColumn/cols.
I didn’t mention many engineering details such as boosting, as the fundamental details of the algorithms are difficult enough to digest.
On your second question, the one (columnar) SDR is a sum of two orthogonal vectors - predicted and bursting. You can tell which is which from how many cells share their column index, and you can identify the per-column prediction error by the same counting method.
Temporal Pooling is a client of this functionality, not its implementation. In addition, every layer does paCLA but not every layer does extensive or dominant Temporal Pooling.
I would love to know if there are some practical outcomes of this. On paper, it looks something drastic and it would definitely change temporal representations fundamentally but I cannot say exactly what kind of change it will make. Do you have a good analogy/story that we can relate about it?
Oh I see, so we have more information about anything to do with classifications as well. This could also be incorporated into adapting synapses. Does paCLA use per-column prediction error directly in adapting synapses in anyway?
This is how I do it too in my implementation. On top of sparsity, the inputs encapsulate any temporal information this way and similar representations becomes more decoupled because of this.