Why Does the Neocortex Have Layers and Columns, A Theory of Learning the 3D Structure of the World

theory
sensorimotor

#21

The reason we concluded that this is happening in each cortical column is that allocentric location has to be calculated for each local patch of sensory cortex. In V1 and S1 there need to be hundreds, probably thousands of separate allocentric locations. To calculate allocentric location you need local sensory input as well as motor input and a fair amount of intercolumn communication. There are no other locations in the brain that have the required amount of neural tissue and have the necessary anatomical connections. The cerebellum would be the best candidate but it just doesn’t have the connections needed (plus you can live an almost normal life without a cerebellum). That doesn’t mean the cerebellum doesn’t do something like this, just that the cortex isn’t relying on it.

Each cortical column has a large amount of neural machinery which is completely unexplained. We are working our way through how layers 5 and 6 (which are actually multiple layers each) perform the allocentric calculation. So far the anatomy and physiology is matching well.

What role allocentric location plays in audition (and higher thought) is an interesting question.I have some ideas on this for a future discussion.


#22

Perhaps is the cerebellum “not critical” because the cortex is flexible enough to perform cerebellum functions if it fails? In any case DCN seem to be a fundamental piece to understand language. For example, the degree of success of Auditory Brain-Steam Implants (ABI) in children is really low [1] (ABI are done when coclhear implants are not possible due to lack/damage un the of auditory nerve or cochlea)

The dendritic tree connectivity of Purkinje cells is really massive. Seems to be some sort of bus (each input bit can be combined really a lot of state signals [such as the cortex prediction, body position, etc…]). Pyramidal cells seems to be more like a point-to-point network. For this kind of problem (all-to-all) is more efficient to use a bus. For example, many super computers (such as the BlueGene) have a point-to-point network for regular communication and a bus alike (a tree) for broadcast communications.

Certainly language (ASR) is really interesting but hard. Just to map different speakers phonemes over the same set of columns is a though problem :slight_smile:

[1] E. P. Wilkinson et al., “Initial Results of a Safety and Feasibility Study of Auditory Brainstem Implantation in Congenitally Deaf Children,” Otol. Neurotol., vol. 38, no. December, pp. 121–220, 2016.


#23

I suspect the answer to this question will involve motor transforms. In sensory regions, motor signals change the orientation and state of objects (this is something we are currently trying to understand). In higher regions, the same mechanisms might allow the transformation of ideas in an abstract space. For example, when I work on problems, I feel as if the problem I am working on has structure, almost like a physical object, but not quite. I speculate that when we manipulate ideas they have their own space and “motor” capabilities and rely on these to find a model that fits the data. This probably sounds vague, because it is, but I think we can figure this out.

I actually have an experiment in mind that would exercise this idea. I need to write it up. It blurs the line between motor commands and conceptual understanding.


#24

Hi David @cogmission,

The full video is on the Numenta YouTube channel: https://www.youtube.com/watch?v=fhnMUc36opI

As @lscheinkman mentioned, it’s also available as supplementary material supporting the paper on the bioRxiv entry: http://www.biorxiv.org/content/early/2017/07/12/162263, if you’re interested in downloading it. But if you just want to view it, the YouTube link has the full 4.5 minutes.

Thanks for the note!

Christy


#25

I noticed in your recent papers that the hierarchical aspects of the CLA have been left aside. Are there particular reasons for this in your thinking more than just putting a hold on the bigger picture as of now? I get the impression that hierarchy is considered of less importance now than from earlier presentations of HTM theory. From reading On Intelligence, for instance, I would have concluded that the locally predicted features as pertaining to allocentric location would be strengthened and fully dealt with by the higher level context feedback.


#26

Our view of hierarchy has changed, somewhat. We now realize that individual regions are far more powerful than we previously thought. It doesn’t make sense to focus on hierarchy until we completely understand what regions do. For example, there are several different sources of both feedforward and feedback signals in the hierarchy. These originate and terminate in different cellular layers. We need to understand what type of information these layers represent to understand what we gain from a hierarchy.

We are actually getting close to understanding all/most of this, in fact knowing that some layers project up or down the hierarchy helps us determine what those signals are, but we are not ready to focus on hierarchy yet.


H is for Hierarchy, but where is it?
#27

BTW @jhawkins have you seen this yet: https://physics.aps.org/synopsis-for/10.1103/PhysRevLett.119.038101?
The arXiv link is https://arxiv.org/abs/1706.06914


#28

About whether or not L6a and L5 form another instance of the input/output layer model:

I think L5 is similar to L2/3a, but it solves a related problem. L5 generates behavior by predictively firing. The way I see it, L2/3a does something similar to predictive firing, because it represents the features it might sense on the object if the sensors were to touch the correct location. So both output layers sort of predictively fire, although it’s not the same as activating predicted cells in temporal memory.

The thick-tufted/pyramidal tract cells in L5 don’t seem to represent entire external objects. Before a saccade, each cell stops responding to the features in the receptive field and starts responding to the features expected to be in the receptive field after the saccade (Shin and Sommer, 2012). Only ~10/50 cells in that study stopped responding to the current receptive field and started responding to the post-saccade receptive field (most did one or the other). Still, it doesn’t make sense to keep responding to the current feature to generate a saccade to another feature.

That doesn’t mean L5 should be an input layer. It’s useful to narrow down possibilities using lateral connections between columns for any sort of prediction. The ability to narrow down possibilities is also useful for action selection. By narrowing down the represented set of features, it can narrow down what it wants to perceive as a result of moving sensors or changing the environment. There is biological evidence that this happens.

Maimon and Assad, 2006 suggest that a reverberant circuit involving the cortex and the basal ganglia causes ramping pre-movement activity, which triggers a burst of activity once it passes a threshold to cause movement. In the period leading up to a saccade, some neurons in superior colliculus, mediodorsal thalamus, and the frontal eye field (at least pyramidal tract cells) increase their firing rates until just before the saccade, reaching firing rates above 100 hz (Sommer and Wurtz, 2004), which is important because bursting above 100 hz might play roles in pyramidal tract cell plasticity, lateral communication, and corticothalamic signaling, and bursting is controlled by complex disinhibitory and winner-takes-all circuits.

In addition to the ramping activity of cells which will trigger the movement, cells which do not contribute to the movement probably gradually decrease their firing rates (see Maimon and Assad, 2006), especially since pyramidal tract cells stop responding to the current visual input as part of presaccadic predictive remapping and the current visual input is not the saccade target. Since the basal ganglia also exhibits ramping activity before a movement (Lee and Assad, 2003), it could selectively gate a feedback loop from FEF to SC to MD to FEF to narrow down options based on reward learning. It could also control timing based on the amount of positive feedback (and thus rate of ramping) it allows.

Another difference L5 has with L2/3a is that it should only learn to represent possibilities that the brain can cause by behavior. Otherwise, it might try to cause predicted external events. Thick-tufted cells in L5 respond well to sensory input without behavior (Oberlaender et al., 2011). However, slender-tufted cells seem to track behavior because they increase their firing rates during behavior on average, and those in barrel cortex are modulated by the phase of the whiskers (Oberlaender et al., 2011). They also receive strong input from L4 (Schubert et al., 2006) and they might be highly sensitive to the combination of self-motion and sensory input, although I’m not sure which cells this study recorded (Turner and DeLong, 2000). It seems that slender-tufted L5 cells copy L4, except they do not respond to sensory input alone, so they might track the sequence of behavior and behavioral results.

Slender-tufted cells project to thick-tufted cells, but not really vice-versa (Naka and Adesnik, 2016). Perhaps slender-tufted cells act as another input layer to thick-tufted cells. Regardless of their projection to thick-tufted cell basal dendrites, they also project to the apical tuft of thick-tufted L5 cells. Oberlaender et al., 2011 propose that this projection allows thick-tufted cells to perform coincidence detection between behavior and sensory input, since thick-tufted cells respond to temporally associated proximal and tuft input by bursting. Since pyramidal tract cells probably burst to generate movement, this is a possible mechanism to ensure that they do not burst to cause an external event.

They still might predictively fire in anticipation of an external event, which might allow them to model objects at the same time as they generate behavior. This could explain why only ~10/50 cells both stopped responding to the current input and started responding to the post-saccade input. Alternatively, since bursting might be required for LTP everywhere on the cell (Kampa et al., 2006; Ramaswamy and Markram, 2015) and cells might continue their burst for a bit after the end of the movement, thick-tufted L5 cells might learn not to predict things which behavior cannot cause.

There are other possibilities for preventing bursting unless an event was caused by behavior, such as input from motor thalamus (or e.g. POm) or inhibition of bursting by martinotti cells during quiet periods. However, those both have possible problems.

I’m not sure how the thick-tufted cells or subcortical structures convert predictions into behavior to cause those predictions. If bursting depends on slender-tufted cell input to the apical tuft, then pyramidal tract cells might learn to only ramp activity and burst if they target the correct subcortical cells to trigger the result.

I’m going to quote some parts of the paper.

Recent experimental studies found that the axons of L6 CT neurons densely ramified within layer 5a in both visual and somatosensory cortices of the mouse, and activation of these neurons generated large excitatory postsynaptic potentials (EPSPs) in pyramidal neurons in layer 5a (Kim et al., 2014) [. . .] There are three types of pyramidal neurons in L5 (Kim et al., 2015). Here we are referring to only one of them, the larger neurons with thick apical trunks that send an axon branch to relay cells in the thalamus (Ramaswamy and Markram, 2015).

In barrel cortex, thick-tufted cells are mostly confined to L5b, whereas slender-tufted cells are preferentially but not exclusively in L5a (Naka and Adesnik, 2016; Groh et al., 2009). Slender-tufted cells do not project to the thalamus, only the striatum (Oberlaender et al., 2011). I think L6 probably targets another input layer, the slender-tufted cells (which receive thalamic input from POm, part of which is primary sensory thalamus for self-movement), equivalent to L6 targeting L4. So maybe both input/output layer pair is equivalent, but they share an input from L6.

Although slender-tufted cells communicate between columns, they only do so weakly (Oberlaender et al., 2011), so the lateral connectivity is probably for predictive depolarization like in temporal memory. Also, even though they receive thalamic input from multiple whiskers via POm, they have narrow receptive fields (Bureau et al., 2006). They also communicate in both directions with L4 (Schubert et al., 2006), so they might serve similar functions.

It’s also possible that L6 really is the input layer to the L5 output layer, but slender-tufted cells are an intermediate step. This extra step could reflect the requirement to only predict results of behavior, whereas L2/3a can predict both externally- and behaviorally-caused input.

However, there is also empirical evidence our model does not map cleanly to L6a and L5. For example, (Constantinople and Bruno, 2013) have shown a sensory stimulus will often cause L5 cells to fire simultaneously or even slightly before L6 cells, which is inconsistent with the model.

As far as I know, somatosensory cortex (or at least barrel cortex) is the only region where thalamus prominently drives firing in layer 5 thick-tufted cells. Those cells only receive input from narrow receptive fields of the whiskers via VPM, yet they have wide receptive fields (Manns et al., 2004). Oberlaender et al., 2011 suggest that their wide receptive fields result from strong lateral connectivity. Even though thalamus can drive their activity before L6, it probably doesn’t drive most of their responses. Perhaps a little thalamic input to the output layer helps represent the possibilities indicated by the initial input, or perhaps it helps prevent directly sensed features from getting ruled out because they aren’t considered consistent with the other features.

Sources
Division of labor in frontal eye field neurons during presaccadic remapping of visual receptive fields (Shin and Sommer, 2012)
Parietal Area 5 and the Initiation of Self-Timed Movements versus Simple Reactions (Maimon and Assad, 2006)
The time course of perisaccadic receptive field shifts in the lateral intraparietal area of the monkey (Kusunoki and Goldberg, 2002)
Putaminal Activity for Simple Reactions or Self-Timed Movements (Lee and Assad, 2003)
What the brain stem tells the frontal cortex. I. Oculomotor signals sent from superior colliculus to frontal eye field via mediodorsal thalamus (Sommer and Wurtz, 2004)
Three-dimensional axon morphologies of individual layer 5 neurons indicate cell type-specific intracortical pathways for whisker motion and touch (Oberlaender et al., 2011)
Morphology, electrophysiology and functional input connectivity of pyramidal neurons characterizes a genuine layer va in the primary somatosensory cortex (Schubert et al., 2006)
Corticostriatal Activity in Primary Motor Cortex of the Macaque (Turner and DeLong, 2000)
Inhibitory Circuits in Cortical Layer 5 (Naka and Adesnik, 2016)
Requirement of dendritic calcium spikes for induction of spike-timing dependent synaptic plasticity (Kampa et al., 2006)
Anatomy and physiology of the thick-tufted layer 5 pyramidal neuron (Ramaswamy and Markram, 2015)
Sub- and suprathreshold receptive field properties of pyramidal neurones in layers 5A and 5B of rat somatosensory barrel cortex (Manns et al., 2004)
Cell-Type Specific Properties of Pyramidal Neurons in Neocortex Underlying a Layout that Is Modifiable Depending on the Cortical Area (Groh et al., 2009)
Interdigitated paralemniscal and lemniscal pathways in the mouse barrel cortex (Bureau et al., 2006)


#29

TL;DR: What if locations signals also have the purpose of defining the next move to make to gather input data? In such case, they would be useful for abstract objects to, in order to guide the choice of mental manipulations to apply.

I always considered the brain a machine to act, rather than a machine to perceive. After all, natural selection selected us based on our actions and reactions, not on our perceptions (perceptions only mattered evolutionary in the measure that allowed us to perform better actions). Following this philosophy, for each part of the brain I always ask myself “How does it help us to perform better actions?” (if it didn’t, it would have been erased by evolution).

For the areas which recognize concrete objects, I believe that the location signal does not have only the utility of allowing the recognition of 3D objects as per the mechanism described in the paper this thread is about, but also that of helping the recognition process by suggesting actions that would modify perception in order to get the sensory data needed for discriminated between similar objects. For example, let’s say that I close my eyes and touch with my finger a cup. As per the model described in the paper, my finger will recognize a pair feature/location, and the column relative to my finger will generate predictions about other feature/location pairs possessed by objects which comprise the first pair. Now, my finger will have to touch the object in another location in order to generate another feature/location pair to help discriminate between objects. How does my brain decide how to move the finger? My guess is that it will try to move it in such a way as to reach one of the locations relative to one of the predicted feature/location pairs. In other words, the location signal also served the purpose of helping choose the manipulation needed to proceed in the inference task performed by the column.
Proceeding by analogy, I would suggest that for abstract concepts the location signal serves the purpose of facilitating the choice of the next manipulation to apply in order to complete whatever task the column was performing (I explain better the concept of manipulations here https://medium.com/thought-models/mental-manipulations-3b7797343aa2).


#30

Quick question : As all the minicolumns consume the same feedfarward input , is there a preference between any one particular type of dendrite like proximal basal dendrites for these feedfarward inputs to connect to only to these proximal synapses in the neurons of these mini column.


#31

The feed forward inputs to cells in a minicoluimn (such as in Layer 4) make large synapses that are close to the cell body (proximal). These synapses are estimated to less that 10% of the total number of synapses.


#32

Very basic questions (I’m only half way through the article):

  1. Once an object is learned, can you pick up any sequence in the middle, or backwards? For example, let suppose you touch an object from left to right, and from top to bottom. On a recognition trial though, you touch it in a different order. Would it still be recognized? Would an SDR learned by touching in one order by different than an SDR learned by touching in a different order?

  2. If four (lets say) adjacent columns are involved in recognizing the same object, I notice that the SDR --that their set of minicolumns converges to-- is different for each column. Is each column a stand-alone representation of the same object? If the other 2 columns disappeared, would you have the same functionality (though slower to recognize an object)?

  3. is the location signal coming into the input layer an SDR?
    Is the feature signal coming into the input layer also an SDR?

  4. The authors talk about the size of the output layer, and its effects on learning, and they also mention the number of cortical columns and its effect on leaning. What is the difference between the meaning of the two numbers. I thought the size of the Output layer was related only to the number of columns (and maybe the number of mincolumns within them). Is that not correct?

Thanks.


#33

Hi @gidmeister - these are good detailed questions.

  1. The system just uses knowledge of the object and the upcoming location to make its predictions. As a result, the temporal order does not matter. This is an important property that distinguishes temporal sequences from sensorimotor sequences. In our experiments the order during inference is different from the order during training.

  2. The exact set of cells in the SDR for an object will be different in each column. Each column independently recognizes objects, but all the cells for a given object reinforce each other within and across columns. Yes, it would work fine if some of the columns disappeared, just would be a bit slower.

  3. Yes, both inputs are SDRs. Those are the only two inputs into each column.

  4. Sorry for the confusion. By size of output layer we meant the number of cells in the output layer of each cortical column. You are right that the output size of a full network would be this number multiplied by the number of cortical columns.

Hope this helps!


#34

Thank you Hawkins et al. for the publication, an exciting read.

Also, @Casey, thanks for presenting those research. Since the authors of the paper did not respond to your detailed explanation:

From what I understood, there isn’t any strong contradiction between the paper and your findings but you seem to have a different understanding on the exact role of l6 on l5. To summarize, you hypothesize l6 to have a more modulatory role rather than driving role that narrows l5 predictions to the ones only possible through behavior. This is actually also implied by the authors where they quote Constantinople and Bruno, 2013. Is that correct?


#35

The paper says that when a new object is encountered, there is a learning phase where: “a sparse set of cells in the output layer is randomly selected to represent the new object”. This set of cells is not permitted to change while the object is explored. I understand the reason for this, but there is a contrast with the spatial pooler where similar objects have similar SDRs, because they share some active columns.
If you select cells randomly (in the Output layer), then you lose any reason why similar objects should share many cells in that layer.
So in the Output layer, there would be no generalization between similar objects.
By 'generalization", I mean that if you were to clamp a set of columns in the Output layer, they would recreate a union of related objects in the input layer.
Am I going off track here?
Thanks


#36

I haven’t really done any research on L6, so I don’t know if the input to L5 has a driving role. However, the paper says that L6 cells project to L5a. However, the type of L5 cell they are proposing as an output layer (larger neurons) are essentially limited to L5b in somatosensory cortex, and somewhat in visual cortex.

That means that if L6 is the input layer, there probably must be an intermediate layer, L5a, between it and the output layer, L5b. I think that intermediate step might be required to prevent L5b from making predictions which it cannot generate. For example, if a bird flies by, L5b shouldn’t learn to predict that, since no behavior can reliably cause that.


#37

I agree with your observation. The way we choose an SDR in the output layer does not allow for generalization. This is a problem. We don’t know the resolution of this problem yet. One possible solution, which I happen to be working on this week, is that the upper layers represent specific objects (specific features at specific locations) and the lower layers represent non-specific, or shared, representations of objects. For example, it is possible that the lower layers represent the shape of a currently observed object. Two objects with similar physical shapes could share behaviors. I am not saying we have an answer yet, but the general idea that the upper layers represent specific, aka learned, objects, and the lower layers do their thing independently of whether the object has been learned or not, is feeling pretty good right now. Behaviors would be learned in the lower layers and therefore shared among objects.


#38

Indeed, they should share behavior to manipulations such as rotation (e.g. if I rotate both a cubic box and a dice, I expect sensorial input, especially the visual one, to undergo the same transformations).


#39

The key to solving this problem may be in understanding how the transition between different objects should function. A couple of things stand out to me when reading the paper.

The first thing that stands out to me is, “a reset occurs when the system switches to a new object, and a different set of cells will be selected to represent the new object”. This seems very mechanical to me. If the reset is removed from the process, there would need to be some mechanism for handling unexpected feature-locations from the input layers. One scenario is that you are learning new features on the current object. The other is that you have switched to a new object. The implementation would need to handle both scenarios.

The other thing that stands out to me is, “a sparse set of cells in the output layer is randomly selected”. In this case, the columns do not appear to have any function. This eliminates one of the important functions present in temporal memory for detecting when the sequence has changed – bursting. Perhaps a different pooling strategy that first selects columns might be worth exploring. Bursting in the object layer would allow predictions to appear for other known objects which share those feature-location pairs. Transitions between known objects could then occur. It would also address learning new features on existing objects, by modifying the representation of the object as new features are encountered.

This is obviously just conjecture on my part, so I’ll work on putting together some actual implementations to explore the problem further.


#40

It doesn’t have to be too mechanical. If the column sees a completely unexpected input, and the same happens to neighboring columns, it is safe for it to assume they switched to a new object.