Column Pooler as a Temporal Memory

I hypothesis that the column pooler can be modified to function as a temporal memory. I hope to test this soon. The column pooler, if you aren’t familiar, is described at:

The column pooler combines the proximal dendrites from a spatial pooler and the distal dendrites from a temporal memory, and describes how to integrate them to achieve sensory-motor inference. This post describes an extension to the column pooler which integrates the proximal & distal inputs in a different way, with the hypothesized result of forming mini-columns on an ad-hoc basis.

What is a mini-column?

  • A group of cells which respond to the same feed forward input. There could be an explicit mechanism causing this, or it could be an emergent or learned property. Is there evidence that cells belong exclusively to one mini-column?
  • If any cells in the group are predicted, then they fire first and suppress the unpredicted cells. This results in a variable sparsity, and mini-columns can “burst” with activity, which allows the TM to represent union of contexts.

I will model this as two rounds of competition in the column pooler, which I will call the early and late phases:

  • Early phase competition is for predicted cells only.
  • Late phase competition is for all cells which didn’t activate in the early phase.

The two rounds of competition have different parameters. The results of the early phase controls the parameters of late phase competition.

  • Sparsity is the primary link between the two rounds:
    latePhaseMaxSparsity = f ( earlyPhaseResults )
    • Early phase can suppress late phase, allowing for variable sparsity and the bursting behavior.
    • Late phase can be guaranteed a minimum number of activations, for the situation where the full quota of predicted cells activate, but unpredicted cells have significantly greater overlap.
  • Activation thresholds are also different, and early phase effects the late phase threshold.
    • If the early phase is too sparse, then lower the late phase threshold so that it will “burst” with activity.
    • If the early phase is very active, raise the threshold to suppress late phase cells.
    • The early phase overlaps can set the late phase activation threshold, to ensure that predicted activations suppress unpredicted cells with significantly less overlap.
  • The learning rates of cells firing in the two phases can also differ and interact.
    • This could allow late phase bursting cells to learn the mini-column structure. Bursting is the only time that all of the cells in a mini-column are all active at the same time, making it the ideal time to learn the feed forward input pattern.

This extension can coexist with the current notion of a column pooler. Simply set all early and late phase parameters to the same values and use the following equation for sparsity, and this extension goes away.
latePhaseMaxSparsity = maxSparsity - earlyPhaseSparsity

2 Likes

This is an intriguing concept.

I would invite you to consider the biological correlates of what you are proposing: two populations of inhibitory interneurons. One is local and fast acting to mediate mini-column predictive behavior. A longer reaching / slower acting interneuron population mediates the macro-column formation. (For me - the hex-grid formation).

I would invite you to examine the cortical neural zoo and see if you can match up the theoretical formulation with the known inhibitory interneurons and see if there is a match:
Pulling these from prior posts:

Simplified schematic:

Relative scaling:

Known connectivity:

The schematic Images above are taken from:
Cortical Neurons and Circuits: A Tutorial Introduction
http://www.mrc.uidaho.edu/~rwells/techdocs/Cortical%20Neurons%20and%20Circuits.pdf

3 Likes

Here is an interesting paper:

  • “How fast is neural winner-take-all when deciding between many options?” Birgit Kriener, Rishidev Chaudhuri, and Ila R. Fiete http://dx.doi.org/10.1101/231753 . This paper explains why the competition works well, from a computational standpoint. The competition runs fast enough to run two distinct competitions inside of every alpha wave?

I hypothesize that the sparsity competition runs at the gamma (40 hz) rate and is expressed at the alpha (10 hz) rate. There are differing opinions on how that sparsity is expressed.

Sorry, I don’t understand. How can the competition “run” at one time and be “expressed” at a different time? How can sparsity be “expressed” other than as a fraction neurons activating?

First - some background:

Numenta has tried to extract the key features of the cortical mechanisms and I do think that they have hit many of the important behaviors of the cortical mini-column. I agree in general with the A Theory of How Columns in the Neocortex Enable Learning the Structure of the World paper and my quibbles are minor in nature. I think that this is a very important paper and every student of HTM should study it closely.

As a thought experiment I would suggest that when you read it you contemplate what would change if there was no “location” signal involved. I see that it still works very well.

Note that in the Numenta paper The HTM Spatial Pooler—A Neocortical Algorithm for Online Sparse Distributed Coding they skip right past how the spatial pooler does what it does and substitute the k -winners-take-all selection mechanism.

In the columns paper, they consider that lateral connection are important for binding the columns together but I do not see where they describe the interaction of those connections with the spatial pooler or the implications of the mean average lengths of the lateral connections and how that constrains the geometry of the activation patterns.

This is the heart of most of my quibbles with the Numenta papers: I have been studying these neural mechanisms for years; the k-winner selection is a gross simplification of the actual neural mechanisms involved. The differences between k-winner and the underlying neural hardware has important implications in how these systems work. The details of the lateral connections are highly dependant on the interactions with the inhibitory interneurons and the length of these connections. This is a very important point - minor variations in these interactions can result in gross changes in system capacity, convergence and stability.

Any dedicated student of HTM theory should dig into these neuroscience mechanisms at some point to learn what things Numenta has captured faithfully and what parts they are ignoring to make the modeling easier. In my opinion, if I had to summarize this, Numenta has had a laser focus on the pyramidal cell within the mini-column to exclusion of all other cell types in the cortex and the interactions between these other cells and the lateral connections between mini-columns.

From what I have seen published, Numenta is just starting to take on the interactions with the thalamus. It is my hope that they move on past trying to fit everything to the object model and start looking at how these parts are involved with more mundane functions like Recurrent thalamo-cortical resonance.

The mini- and macro-column is composed of many components, both vertically in the column, laterally in-between the columns, and laterally between neighboring columns. I include the thalamus as a “7th layer” of the cortex; In reality, it should get at least 3 layers worth of function.

Note that there is more than one cortical-thalamic/thalamic-cortical signal path.

To your question:

(Gamma / 40 hz) The components that work together to enforce sparsity are the inhibitory inter-neurons and lateral connections between mini-columns.

Some of the inputs to these gamma rate components are rising axonal projections from distant maps, rising axonal projects from within this map, rising thalamic projections, partial depolarization from prior activations (predictions), neural fatigue (too much winning!), winning mini-column competition, cortical-thalamic-cortical resonance projections, and RAC activation signals. There are vertical connections between various layers within the mini-column as different layers process their part of this mix of inputs.

(Alpha / 10 hz) These components act on the rest of the winning macro-column to signal the winning sparse state through L2/3 intermap connections and L5 & L6 projections and cortical-thalamic/thalamic-cortical projections. This activation is a sparse summation of both spatial and temporal recognition.

1 Like

Hi,

Also, you might want to take a look at some work Fergal Byrne (one of the HTM community pundits from way back) - has done on Prediction assisted spatial pooling…

2 Likes

They didn’t go into a lot of detail about the neuroscience, and I think that’s intentional. Instead they talk about the computations in the brain and the mathematics behind them. I see these papers as introducing a new type of math (SDR) and defining some operators for it (SP, TM, etc). At least that’s what I learned from them.

K-winners is a mathematical tool, for approximating a competition. There are other ways to model a competition, such as using a homoeostatic activation threshold, or modelling the inhibitory cells. But they all require simulating at a higher time resolution, such as 10ms instead of 100ms. I guess it’s a matter of how much fidelity you think you need. The rest of the HTM is absurdly low fidelity: binary synapses, binary activations, one proximal dendrite, etc. I don’t think its that much of a stretch to think that the competition can work in a simplified way too.

2 Likes

Yes, this is true.

On the other hand - there are important differences.

Since these are foundation papers everything after these papers is an extension of the assumptions of these papers and the differences accumulate as each paper makes simplifications in the models.

This is unfortunate as I think that it is misleading and preventing some important progress in the HTM model.

For example - in your proposed model, you have two phases of sparsification, early and late.
You depend highly on the difference in sparsification as the delta between predicted and not predicted as an error signal to drive training. The validity of this depends heavily on the nature of what it might mean if the only thing that is happening is the lack of prediction.

If what is really happening is as I suspect - both spatial and temporal learning is distributed over an ongoing rolling window (10 hz) of activation resulting from a much higher speed filtering (40 hz) in an attractor model - you won’t get the same thing as I will from my more biologically accurate model.

In particular - the deficit will be in the interactions between columns that is a key principle in the column learning paper. The patterns won’t bind properly. Look at the soup of inputs and see that the temporal prediction is only one of the signals that is being learned. The lateral interaction that forms the larger spatial pattern is not accounted for. The cells that are being suppressed by the gross k-winner can be part of what should be activated for the next part of the pattern transition sequence in a biologically accurate model.

It might be computational useful but not very predictive of what the brain is really doing. I do like the Numenta premise of trying to reverse engineer the brain and see this as an important guiding principle.

If there are no predictions then:

  • The early phase has no activations
  • The late phase is then allowed many activations (mini-column bursts)
  • Winner cells are chosen to learn the temporal context.

Can happening during both early & late phases.

True.

I understand the theory of how these happen in the brain and why they’re important. I just don’t think they’re critical to demonstrate, that a temporal memory can work without explicitly defined mini-columns.

I have a hypothesis about how this happens, I’ve talked about it in the thread: Prototype of Stability Mechanism for Viewpoint Invariance

I think the ideas about hierarchy are predictive of what the brain is doing, and those ideas stem from ideas about connections in the same layer of the same and different regions. I submitted a manuscript partially based on HTM’s alternative to the conventional hierarchy.

Based on the barrel cortex, it doesn't make sense for levels of the hierarchy to process the sensory input serially from one level to the next.

Each level needs to process it in parallel, like each region tries to recognize objects in parallel in HTM theory.

In the barrel cortex, the barrel domain is a primary region, and the septal domain is higher order. Corticothalamic projections from L5 and L6 both indicate that the septal domain is higher order. The two domains seem to merge somewhat, even in L4, because gap junction networks cross domains and the two domains are physically interwoven a lot. This contradicts serial hierarchical processing because, if there are lateral connections in the same layer between levels, the second level of the hierarchy starts processing at the same time as the first.

This helps explain why higher order thalamic nuclei can receive direct sensory inputs and why koniocellular cells in LGN target V1 as if they were higher order. Each level processes in parallel, so hierarchy is flexible, e.g. input from L5 to koniocellular cells isn’t required for them to be higher order. The evidence from L6 CT projections suggests many primary regions might have higher order components.

2 Likes

There are normal minicolumns in the layer currently thought to be the column pooler, L5st I think. I don’t understand the details of your proposal, so the following suggestion might not make sense. Maybe a modified TM could disambiguate objects/vote by predicting which objects (minicolumn selected cells) will be consistent with the next feature.

I experimentally tested this idea. It appears to work, but does not outperform the state of the art spatial pooler & temporal memory combination.

Methods

I trained the model to recognize text. The dataset is the names of the 50 United States. Each character is associated with a random SDR, which is its encoding for this experiment. The model is shown each character in a name, one at a time. The model is reset between each name. After seeing the final character of each name an SDR Classifier either trains or tests the model. The dataset is shuffled and shown to the model 100 times.

Methods of evaluating mini-columns:

  1. Given an input with no context, the cells which activate are the mini-columns which respond to that input.
  2. Given an input with context, verify that the active cells are part of the expected mini-columns. A subset of the mini-column cells should activate. Compare overlap between active cells and all known mini-columns to check recognition of feed forward input.
  3. Given a prediction failure (caused by unexpected input) verify that the correct mini-columns burst. Verify that it’s able to recover when the unexpected input is the start of a known sequence.

Key parameters:

  • cells: 6000
  • max_sparsity: 6%
  • min_sparsity: 2%

Results

Mean & Standard Deviation of classification accuracy over 100+ test runs: 91% & 3.7%

Percent of cycles where the correct mini-columns have maximum overlap with the activations : 89%

Percent of cycles where the correct mini-columns are within the top 3 overlapping with activations: 98%

Variable sparsity demo. Word: “MISSISCALIF”

M :   Sparsity 0.06 	Anomaly: 1.0
I :   Sparsity 0.02 	Anomaly: 0.0
S :   Sparsity 0.02 	Anomaly: 0.0
S :   Sparsity 0.02 	Anomaly: 0.0
I :   Sparsity 0.02 	Anomaly: 0.0
S :   Sparsity 0.02 	Anomaly: 0.0
C :   Sparsity 0.041 	Anomaly: 0.5083333333333333
A :   Sparsity 0.02 	Anomaly: 0.0
L :   Sparsity 0.02 	Anomaly: 0.0
I :   Sparsity 0.02 	Anomaly: 0.0
F :   Sparsity 0.02 	Anomaly: 0.0
1 Like

@dmac: thanks for your nice report.
About your character encoder: Why did not you use ScalarEncoder for each character using it’s ASCII-code? I believe it is good for classification and predicttion too…

What I used is equivalent to a Random Distributed Scalar Encoder

Any potential performance advantages?

2 Likes