This spring we have been working on extending HTM theory to include behavior and sensory-motor inference. You can think of it this way. The Spatial Pooler and Temporal Memory are able to learn the structure in a time-varying data stream. But time-varying data is only one type of data. Most of the structure in the world is relatively static. The cortex learns the structure of static items by movement. You learn what a coffee cup is by moving your fingers over it and touching it in different locations and/or by fixating your eyes on different parts of the cup.
What we have been doing over the past few months is extending the concepts used in the temporal memory to perform learning and inference via movement. The two forms of inference are similar. With temporal memory, the order in which sensations occur is linear and unchanging, A leads to B which leads to C etc. With sensory-motor inference the order of sensations is not fixed but determined by your own behaviors. Sensory-motor inference has some additional complications over temporal memory but the resulting predictive model is similar. We believe that every section of cortex is learning and inferring both time-varying patterns and sensory-motor patterns. Our current belief is that layer 4 cells are performing sensory-motor inference and layer 3 cells are learning sequences. The two modes of inference share the same spatial pooled representation. Sensory-motor inference requires converting sensor locations, which are in body coordinates, into locations on an object, which are in “object coordinates”. The evidence suggests this is happening in layer 6 which is the primary input to layer 4.
We have made what I believe are some significant advances in understanding how this all works. We haven’t talked about it publicly as it is very much a work in progress and we haven’t yet written it up in any digestible format. A few weeks ago I did a talk on the whiteboard at Numenta reviewing the new ideas for our employees. Subutai recorded it on his laptop. We decided to post this video. If you watch it please forgive us for the poor quality, it is just an ad-hoc recording of a discussion we had in the office. However, we looked at it and felt that the quality was good enough that some of you would prefer to see it rather than wait for a more formal write-up. We will be happy to try to answer any questions you have about this at next Tuesday’s office hour that Matt is setting up.
p.s. Some of you may be familiar with work we did with sensory-motor inference two years ago. We ran into various scaling issues with that approach. The new approach addresses those issues and introduces some powerful new concepts.
Thank you very much for (once again) having such an open and responsive approach. I can see Jeff is very excited, which lets me know that powerful additions to the “Theory” are burgeoning.
I enjoyed watching this very much -and- one of my questions which the video is not really answering is with regard to the assertion about the columns (mini-columns) learning/acquiring models of the world. One thing that I felt I didn’t have adequate information about in order to understand this was that the columns “contain” this model information without the use of basal/lateral connections? …that the “retention” does not come about through the “connected state”?
So my question is, in what form is this model retained by the mini-columns? Where does this model live (within the mini-columns) / How is it stored?
I think this is a big step forward. It’s important to note, and usually unstated when you’re speaking, that what you’re describing is not just a possible process using an unknown mechanism, but a concrete use of known properties of a particular, well-understood mechanism.
I have a couple of questions.
The word “column” has multiple meanings in neuroscience and HTM, so for this topic let’s pretend that “column” only means the Cortical Columns (CCs) as you use them in the talk: 1mm x 1mm patches of neocortex which have some spatial topological relationship with the input space.
You describe the modular structure of a region in terms of CCs which each learn a model of their world. Do you think that each CC largely includes its own multilayer circuit? If so, then your comments about each CC “voting” about the object identity (via L2/3) would also extend to CC’s “voting” about feedback to lower layers (via L6) and “voting” for movement (via L5).
Should your CCs correspond exactly to the barrels in barrel cortex, or is a barrel a larger structure involving a number of CCs?
You say that the calculation of the transformations could all be done in L4. This may be true - we know that a single-layer network can, with enough units, do almost anything - but is it not more likely that this processing is mostly happening in L6? It seems plausible that L4 is specialising on integrating sensory information with received spatial location information, and L6 is integrating L5 outputs, inputs from “where” pathways, and feedback, and producing CC-appropriate inputs about object-space location to L4.
You mention the presaccadic “shift of receptive field” which has been observed in visual cortex. I believe this concept has been misnamed, because it suggests that a fixed set of neurons can shift its receptive field in sensor space. What actually happens is that a CC is predicting that a certain stimulus is going to suddenly appear in its fixed receptive field because the intended movement will shift object space in a certain way. Is this your understanding too?
In that vein, it seems plausible that your object space can be represented or proxied by an integrative, relative motor space (relative to some point on the object). This would be much easier to compute and use in neocortex than an abstract object space.
I think you’re correct that a huge amount of knowledge is contained in single CCs and in a single region, but efficiency requirements suggest that hierarchy and network interactions at the region scale are crucial in neocortex, especially when the sensory input is as large as in vision or touch. There is evidence that separate regions concern themselves with representing egocentric spatial location, apparent size, orientation, surface characteristics, object identity, and so on.
This idea of sensorimotor inference appears to be happening at all scales in slightly different ways. Dendrite segments in each neuron integrate temporally-related predictive signals. The HTM minicolumns in a layer attempt to integrate their feedforward inputs with lateral and top-down signals and predict their future activity. The multilayer module in a CC does the same for a whole chunk of its sensory world. A region of CCs participates in a hierarchy/network of regions to form a coherent model of the world based on its modality. And a network of regions forms a coherent model of the world by combining spatial, temporal, causal, behavioural and logical structure as modelled by its regions and their subnetworks.
@cogmission Jeff is referring to Cortical Columns, which are large-scale (1 x 1 mm) patches of cortex in a region. Each one might contain a few thousand minicolumns (HTM columns) of a few dozen neurons. Blame the 1960s neuroscientists for using the one word to mean two very different things.
@fergalbyrne Thank you for that clarification! That is a very interesting (and unfortunate) nomenclature boo boo - though I believe I still need some ironing out of the form of this “model memory”. My question still stands:
How/In what form - does this model get retained? (I think I would also like to hear from Jeff or Subutai or another Numenta representative with more of an “insider’s” perspective?)
I remember Jeff specifically stating that they believe this transformation to be occurring (mostly?) in Layer 6, so yes your observation was confirmed in my mind. I’ll see if I can find the specific video index where he said this…
EDIT: I’m re-watching the video, I did see where Jeff said this was occurring in layers 3/4 like you said… But I also remember Jeff saying some of this transformation was occurring in Layer 6 - so I’m confused. I’ll try to find the specific segment I’m referring to.
LATER EDIT:@fergalbyrne Ok… The only thing I can find is the explanation of Layer 4 inputs Jeff gives at index: 31:40, where he describes the percentage of input stimuli coming in from various source locations.
@fergalbyrne Oh! right here! It was in Jeff’s introduction in the thread above!
Is there currently a method for deriving that X transform? The biological version I guess is learned. Have you guys come up with something to do the same? The thread seemed to hint at it.
I pushed Jeff to post this before he traveled this weekend in order to have this video available for the community to watch over the weekend before we have the Office Hour Tuesday.
Unfortunately, that means he’s offline for most of the weekend, so I don’t expect him to be able to respond to your questions until Monday. You can also join the Office Hour Tuesday to converse about this (hopefully we have the room and time to address all the questions). I’ll make sure that the most popular questions and conversations posted here are addressed at the Office Hour. So be sure to click the (like) button on the questions you want addressed.
I’d like to thoroughly support the fact that this was posted now rather than when the cleanest possible version was available. I’ll happily watch the future cleanups too, but the incremental learning of this has huge value too.
Thanks for saying that. We were really unsure about posting it. As you can tell there are many details that are not worked out yet, but in the end, this is all part of our open research philosophy.
BioSpaun, like it’s predecessor Spaun is a neural simulation, not an AI/Machine Learning system. As such, it attempts to model neurons, networks and circuits in a much more biologically accurate way, as opposed to HTM which attempts to describe a functionally equivalent implementation of a portion the brain (specifically, the neocortex) algorithmically.
In other words, HTM is an algorithmic implementation of what HTM researchers believe is going on in the neocortex. It is not a direct simulation of the biology itself.
On the other hand, Spaun (and BioSpaun) are functional abstractions themselves to a great extent - not nearly as biologically realistic as, say, the Blue Brain Project (BBP) which attempts to model down to the ion channel or even molecular level in some cases. In short, all three of these systems are very different in their implementation as well as their goals. The BBP goal is to model the brain as precisely as possible, the HTM goal is to implement intelligence in a way analogous to way to how the brain implements intelligence, and Spaun is somewhere in between - leveraging computational neuroscience to try and understand the brain but at a higher level of abstraction than the BBP.
I would second that question.
How do you extract the transform-information/operation and how do you modify the SDR that is coming from Spatial pooler, so that it look the same (transform free) to the TM ?
I am back from traveling. I will try to answer a few questions prior to our office hours tomorrow. I will work from Fergal’s list of questions.
We haven’t yet settled on the language to use for the new theoretical ideas and in the recording I didn’t define my terms carefully. A mini-column is about 100 neurons in a very skinny column that span all layers of a region. They are about 30um wide and 2.5mm tall. Mini-columns are a physical reality and we have proposed a function for them in the spatial pooler and temporal memory algorithms. The output of the SP is a set of active mini-columns. The new theory I describe in the video does not change anything about mini-columns. We assume they are still there and performing in the same fashion. What is new is we are modeling L4 and L3 cells in each mini-column, whereas the TM models just L3 cells. L3 cells get almost all their input from other L3 cells (which is what we need for sequence memory). L4 cells get 65% of their input from L6a, some from the equivalent “where” region, some from L2, a few from other L4 cells.
The TM requires a set of mini-columns to work. We typically use 2048 mini-columns, if you go much below that then some of the properties of SDRs start to fail. A couple of thousand mini-columns is the smallest amount of cortex that can actually perform TM. This is roughly equivalent to a patch of cortex 1.5mm x 1.5mm. We didn’t call this a “column” or anything else.
In the new theory we are sticking with the same basic dimensions, just adding more layers. I have been thinking of touch by imaging a small patch of sensors on the tips of each finger. Each patch would feed into a 2048 mini-column patch of cortex. I am pretending there are no other sensors on the hand. This is a simplification, but I believe it keeps the important attributes without throwing away anything essential.
So we now have multiple 2048 mini-column patches of cortex, one for each finger tip. We need a way to refer to them. In the recorded talk I just referred to them as “columns” but we may need a better term. These columns are roughly equivalent to barrel columns in rat.
The important attributes of the “column” are it receives a bundle of sensory input bits and that all these bits are run through a spatial pooler. There is no “topology” within the column, or put another way, all the mini-columns in the column are trying to inhibit all the other mini-columns. This makes it much easier to understand and model, yet allows us to build systems with multiple columns.
Yes. We are trying to understand a 1.5mm x 1.5mm patch of cortex with all layers. That is the goal. The hope is all the important functions of a cortical region are represented in this small patch. It is well known that cells in some layers send axons longer distances within their layer. This occurs in layers 3a, 2, 5, 6a, and 6b. The idea we are pursuing is that the representations in these layers can be a union of possible values and that the “inter-column” projections are a way for multiple columns to reach a consensus on the correct value.
I am sorry if I was confusing on this matter. The model assumes that L4 cells receive a “location” representation on their basal dendrites. Our current best guess is that this location SDR is coming from L6a (again L6a is 65% of the input to L4). L6a and L6b are massively interconnected to the equivalent layers in the equivalent region in the “where” pathway. The basic idea is cells in the “where” pathway represent a location in body or sensor space, this gets sent to L6a in the “what” region (the region we are concerned with), L6a converts this body-centric representation into an object-centric representation. Similarly, a location in object-centric representation is passed to L6b which is sent back to the where pathway. Somewhere along the way it gets converted to body-coordinates.
If you think about what it takes to move your fingers and to predict the next input you realize that the brain has to continually convert between body-centric to object-centric coordinates and vice-versa. This need has well known in the robotics community, all we are doing is bringing it to cortical theory and trying to understand the biological mechanisms.
There is a lot we don’t understand about the location coordinate transformation. In one possible implementation it requires a transformation of the sensory input as well as a transformation of the location representation. IF that is true I propose it is happening in L4 itself. Over the past few weeks i read some papers that suggest we learn objects in body-centric coordinates but then we mentally “rotate” our models to fit the sensory input. This is more than I can write here. We can leave it as transformations have to occur continually and rapidly yet there is a lot we don’t understand about it.
Yes, that is my understanding. This new theory that a column is actually a model of entire objects provides a simple explanation for what has been a mysterious phenomenon.
I didn’t understand this question.
I know. The specificity people report is somewhat contrary to very notion of common cortical function. When I think about this specificity I don’t think it is wrong, but I suspect it is misleading. For example, V4 is often associated with color processing. But input to V4 had to come through V1 and V2, the data was there all along and was also processed in V1 and V2. It is not possible that V4 processes color and V1 and V2 don’t. The biggest problem with most of these studies is it is hard to find cells that reliably fire in response to a stimulus. To get around this problem the animal is often anesthetized, and/or the stimulus is made very unnatural. The simplest example is that cells in V1 behave completely differently when and animal is awake and looking at natural stimuli than when not awake and exposed to gratings.
I am looking forward to more discussions on this topic.
That is what I was saying … but this means that whatever comes from Spatial Pooler has to be transformed … what exactly is the transformation is decided outside of TM.
My question is how do you extract the Transformation-Operation, so that you can apply it before SDR enters TM.
stream ==+------> SP ===> Apply T ==> TM
|-> extract. T ------^
On nomenclature, perhaps “macrocolumn” or CC (Cortical Column) would be the right name for the 1-1.5mm squared patches? Cortical Column is the neuroscience name used by Hinton in his work on Capsules (his name for CCs) here/PDF which are functionally equivalent at least in his theory. Macrocolumn is a quite common name for the barrel-sized columns, and is what Rod Rinkus uses (he calls them MACs) in his SDR-based Sparsey system (see this post from last weekend).
On my question about coordinate systems, it seems likely that cortex uses distances and directions which are related to the motor actions needed to navigate (palpate or saccade) the object, rather than distances and directions in external units which relate to the intrinsic dimensions of the object itself. For example, a 42" TV at 10 feet would have the same saccading “size” as a 21" TV at 5 feet - the visual system treats both as the same. Similarly, experiments with reversing glasses show that we can very quickly learn to redefine up and down in terms of saccading outcomes.
These motor-defined coordinates are used to perform navigation over the object, so they are a kind of object coordinate system, which are relative to some reference point on the object (eg the centre of a TV screen, the central axis of a pen). The “where” pathway will also represent the egocentric position and orientation of the reference point, allowing us to navigate from one object to another, remember where we put an object, reach for one, and so on. [Edit: and a few years ago, I broke my leg stepping down onto a rock because my new glasses made me underestimate the depth by less than 1cm].
On specificity of regions in cortex, the idea does not contradict the generality and commonality of cortical function. What appears to happen especially in visual cortex is that V1 and V2 are huge generalists, extracting as much structure as they can from a very wide data bus. After that, genetic and developmental programs seem to differentiate the kinds of data which flows between regions. This appears (according to the talk below) to be a combination of differential projection and differential synaptic preferences. Since the same differentiation persists within and across species, it must have a genetic component, which is likely reinforced by experience and pruning during development.
It is likely that V4 is better (per neuron) at processing colour, and its output has better colour-related information, than V1 or V2. Similarly, MT is better (per neuron) at motion-related processing. Both these areas feed back to V1 and V2 and no doubt help them with their own processing of colour, motion etc. We can imagine a very early visual cortex which only had V1/V2, and then more specialised areas were added in later as the organism evolved.
Even within V1 itself, there are two predominant designs in mammals: the pinwheel style of bunches of minicolumns which are each sensitive to an orientation, a colour contrast, or a motion (as found in most higher primates), and the salt-and-pepper style where such bundles cannot be detected (as found in rodents). This may relate to the much larger hierarchies in mammals which feed back highly specialised signals to primary cortical regions.
I am sharing this image to confirm the quote, and to make the obvious point that layers of the CC are interconnected to other remote CCs ( via white matter Axons) and to the limbic and brain stem areas, this is all part of feedforward and feedback to each CC. To me, THIS IS the hierarchy of the brain, that connects sensory-motor and controls the attention to each different parts. In all what I am seeing in this thread, H in HTM seems to be focused on the same CC. What am I missing?
@lilacntulip you’re not missing anything, it’s just a bit more complex than your diagram suggests. Every layer communicates with every other layer in both directions within a CC. Every CC connects with neighbouring CCs by layer and with connections from one layer to another, and there are more long range inter-CC connections in regions. Every region is connected to dozens of other regions, primarily in some stereotypical way seen across individuals in a species, often across species, but also in small but significant amounts, with other regions based on an individual’s genetics and experience.