"On Intelligence" vs recent developments: What's puzzling me (and some thoughts about grid emergence)

Hi everyone,
I’m a newcomer here, having read “On Intelligence” like a week ago and trying to keep up to date with your more recent research.

On Intelligence was a great read, and I thought it gave me a clear view of the organisation of the neocortex. With current HTM models and papers about the columns and allocentric stuff however, although excited at first, I feel now like I’m a bit disorientated, having lost sight of the big picture somewhere along the way.

Now, I understand that the hierarchical stuff between regions got somewhat relegated to the background while you focus on understanding the full capabities of the (macro)column and the intricate connections between layers. And having each and every patch of cortex, down to the primary sensory (and motor?) levels, able to get a notion of complex objects is mind-blowing. However this “shift in perspective” comes with the cost, if I understand correctly, of having some capacity limitation problems all over the place. Powerful granted, but a few hundreds of possible stable representations per macrocolumn now feels like it would be somewhat cramped to account for our minds capacity, especially if same few very familiar objects are each repeated across most columns of primary regions… I did come across a poster yesterday which hinted in fact at a more feature-focused storage, where full objects would more likely be combinations of features (coffee cup got cylinder and handle, things like that…) but I’m struggling in finding more details about that idea in your fully detailed papers or even vulgarization videos.

Detailing the feature-oriented insights you have would maybe also allow me to reconcile the fact that SDR “shall” be overlapping more when encoding semantically close, ehm… “things”, and that, according to the paper on the matter of a virtual hand palpating virtual object, current implementation of the simulation do not seem to do that.

Also, I have trouble bringing back together the current simulation of sequential memory with what I think our memory abilities should be (a set of beliefs mostly developed while reading “On Intelligence”). Okay the simulated layers of a current HTM implementation remembers sequences. Of notes, say, if input SDR was to represent notes. But there is nothing like a pitch invariant form here. In fact, even if the network output layers are able to form a somewhat more stable signal that was input to them, there is nothing like the scale-agnostic, translation-agnostic “invariants” I was imagining were at the core of the matter while reading the book. Having learned “Somewhere over the Rainbow” starting with a flat, in the current HTM model, all other notes in the sequences are remembered in the context of that starting flat, but were we to input F# as a starting point, all guesses from the network would be off.

I understand this could be a simple “okay we’ll tackle this later” issue. But to my mind there is something more fundamental at work here. What is more fundamental is that “unless you have perfect pitch”, you cannot even dream of remembering songs in this fashion. Our ability to remember the “relative” form comes either preferentially or before the ability, for only a few very trained (or talented ?) music-enthusiasts, to recognize an absolute version. And I do not believe this is a matter of encoding the raw audio signal in a relative form instead of absolute. It seems very likely, from the little I get about ear cells, that the most basic and direct information they can deliver is an absolute (range of) pitches.

So I was going to post about this yesterday, because this was puzzling me, then I had some possible insight. Automagically recognizing pitch “differences” given an absolute pitch signal requires a quite specific info transform. I was thus wondering if basal dendritic connection trees on pyramidal neurons could represent info transforms such as {(A and B and C) or (D and E and F) or (G and H and I)} (which I realize they can, from the little I’ve read about it, by having different dendritic segments for each expression in parenthesis and a synapse for each letter), which would give them the ability to respond well to a fixed spacing between two absolute values in a given subset of the range. The next cell could specialize into recognizing same spacing across another subset, The cell after into recognizing another spacing… until finally there’s enough of these cells in the output layer for covering the whole absolute range with all possible deltas. However, consistently reaching such encodings of constant deltas while leaving a multi-segment version of the current HTM model learn on its own seems unlikely. But then I got this déjà-vu feeling, between this problem and how “grid cells” mechanisms could be consistently generated. I don’t really know any detail about this, other than there seem to be something similar at stake here. Since your view on the matter of allocentric location signals now requires that a mechanism akin to grid cells is taking place all over the, ehm… place; and that my eyebrow-frowning about pitch invariance could be solved by having such spacing-between-values encodings be more commonplace than if left to chance alone, running current HTM implementation, I’m wondering if… I don’t know… if a mechanism akin to what regulates sparsity and per-column activity in the spatial pooler, would not make reaching such stable “delta recognizers” more likely. A mechanism from which grid-cell-like transforms would spontaneously “emerge” all over the place.

Just wanted to share that problem-similarity feeling. I don’t really know what to make of it from where I am, but maybe this could help somehow.

Now that I’m at it, another thing bothering me about sequences is the inability of the current model to account for sub-sequence recognition. Assuming I’m a poet, I would very likely play with sub-sequences of words. Imagining I’m good at algebra, I would quite likely want to factorize similar subsequences of symbols. As an engineer I’d likely want to abstract away or design copy devices for any recognized subsequences of any stuff. And if I were more skilled in music, I believe I could have fun with subsequences of notes as well. Yet nothing in the current HTM implementation of sequential memory allows any of the simulated layers to extract a subsequence of notes from its context, and that context spans all the way to the first note. Do you believe this is a question to be tackled along with attention mechanisms ? Or is there also something more fundamental to it as well ?

Guillaume Mirey


I share your concerns. You riff a bit on hearing cells having to recognize all sounds; I’ll go you one better with vision cells.

I personally side with object recognition in the H of HTM.

1 Like

Let’s not forget that the “primary mandate” is to faithfully reproduce the biological mechanisms which actually exist - first. Then after we achieve cognitive capability, we can optimize or improve on nature’s version?

In my mind, the instant we inject anything that is biologically implausible (such as improving or altering where feature recognition is happening) - we decrease the probability by magnitudes that we will eventually arrive at a cognitively capable algorithm, and we go down the already trouble-ridden path of conventional artificially fabricated techniques - and will subsequently hit “glass ceilings” where we just can’t go any further because of synthetic mechanisms inserted which don’t exist in nature.

I agree with Jeff Hawkins that the shortest and most guaranteed path is to reproduce a working example rather than out-think it (at least initially).


well the whole approach of “let’s understand intelligence first” rings nice to my ear of course, otherwise I wouldn’t have been reading and watching all things Jeff Hawkins or Numenta related, trying to digest all this since I stumbled upon your endeavor.

Sorry if I haven’t been able to write things clearly, but I’m not suggesting injecting any overtly artificial construct in the model in any way. Quite the contrary, I’m a little lost regarding how the latest publications would come to explain what I intuit of my innerworkings - a feeling I never experienced while reading JH’s book initially.
(I’m not saying HTM current inference or sequential memory model is incorrect either, only that I’m having trouble making sense of it. I’m pretty much a layman here)

best regards,


Great conversation. Guillamume, thank you for raising lots of good points. Here are a few things to think about.

Capacity: We did a lot of testing of capacity using the sequence memory. A network of 32K cells (one layer, one column) was able to learn close to 1M (million) transitions before starting to fail. We were happy how big the capacity was. Note that the sequence memory does not have a “pooling layer”, that is a layer that is stable over the sequence. It is also important to know that the active cell states of the sequence memory are unique to both the sequence and the location in the sequence. You could classify each “note” in the sequence if you wanted to because each note is unique to that sequence.

In the 2017 paper on sensory motor inference we added a second layer, for pooling.This layer was the limiting factor in how many objects the system could recognize, about 500 in the tests we showed. We now realize we got something wrong in that paper. We were using location representations that were not unique to the object. That is you could have the same location on multiple objects. By studying grid cells we now know that locations will be unique to both the object and the location on the object, similar to the sequence memory. The models we are working on today are based on grid cell like representations of location and they don’t need a pooling layer. Thus it should be easy for a column to learn 1M location/object pairs. You could think of this as 10K objects with 100 features each.

Composite Objects: Another thing we are working on is how a single column learns objects as being a composition of other objects. E.g. my coffee cup has a Numenta logo on it. To learn the cup I don’t need to relearn what the logo looks like, I just have to learn the logo once and then associate the previously learned logo at a location and orientation on the cup. We believe this is also being done with grid cell-like tricks, all within the same column. Composite structures help with capacity too. We don’t have to learn the sensation at every location. A new object is a composition of sub-objects. BTW, I would appreciate if you could share the paper you mentioned regarding compositionality.

Generalization: I agree that the current sequence memory and the published sensory-motor object memory systems do not generalize. In fact the only thing that takes advantage of SDR generalization properties is the spatial pooler. This has bothered us for some time. We now believe that the compositionality mechanism is a major source of generalization. Let’s say an object is a composition of ten sub-objects (the sub-objects on a cup could be simple such as a circular lip, a handle, or complex like a logo). We believe that layer 5 forms a union of the sub-objects which includes their relative positions. As you attend to the different features of the cup this union in layer 5 will slowly change reflecting the most recently attended sub-objects. Now suppose you are looking at a novel object. You don’t recognize the novel object so you attend to smaller and smaller parts of the novel object until you recognize its sub-objects. This will build up a union of the sub-objects and their relationships in L5. You now have a mechanism, in L5, to see that two different objects share some sub-objects and their relative positions. We are working on the detailed mechanisms for how this works, but I believe this is the key to generalization. It allows you to see two objects as having similar structure, and would also would allow you to infer behaviors of the novel object,

The issue of melody pitch invariance is a topic for another day. I will just point out one datum. The evidence suggests we are born with perfect pitch and learn pitch invariance. Also, there is evidence that when we learn a new song we learn it in its original key (we are much more likely to sing the song in that key). However, if we hear the song in a different key, we don’t notice the change.

Hieararchy: This is another big topic. We are not proposing that hierarchy does not exist. We are not proposing that hierarchy doesn’t play a role in compositional structure. The main point we are proposing is that all columns are learning complete objects, which is contrary to current opinion. Many columns will learn models of the same object, but not all columns will learn all objects. Also, we now have a basis for explaining many of the long-range connections in the cortex that don’t fit the classic hierarchical model.


Wow, thank you Jeff for your time and such a reply.

What you say about capacity now clarifies a lot of things. Indeed roughly 10K features per sq millimeter of cortex along with a composition mechanism, I can grasp as being more than enough to operate on a virtually unlimited space of representations.

There seems to be a misunderstanding about the poster I was refering to, since you’re credited on it along with Marcus Lewis. I was only regretting I could not find any mention of such feature composition in more detail, other than this single top-left slide. Now I also understand that composition is an area of active research for your team. As someone once wrote on another thread, if I may have a say in the matter, please do not wait too long for your insights to be polished to develop on the topic :wink:

I may be able to get an intuitive feel for generalization based on composition mechanisms, thanks for sharing that. I may have to ponder on that stuff for some time. This is dense. Thanks again.

Oh… I was totally unaware of that. That seems… counterintuitive, but I belive this is enough in itself to sweep away most of my concerns on the matter. Still, this raises the question of how it could be learned, which somehow loops back to a pervasive delta-recognizing mechanism I was mentionning. I wish I understood more about grid cells at this point to get a more precise feeling of whether the two issues are related.

Not once did I think you were ! I acknowledge your focus is on something else.

Many thanks !


I’m interjecting, but it’s worth noting that the ability to watch/observe conversations and discussions like this is one of the reasons I appreciate the online community here, and Numenta in general.

Looking forward to more L5 research, as I think many of us are excited about the possibility of understanding how our brain keeps track of cross-frame
object knowledge

(thinking of older, symbolic-based lisp systems of yesteryear), allowing a generalization. Maybe that generalization would lead to the G in AGI (crossing fingers).


Good day everyone.
I realize my motivation level today is a bit low, procrastinating reading those papers about grid cells and roaming here and there on the forum without a clear goal. I guess I’m still in limbo about allocentric locations and waiting to see whatever Numenta comes up with.

Time to get into action… My trade is software engineering, yet I’m far from fluent in Python… sorry. Something which I could do however, is try and implement some version of an HTM layer in, say, C++. I know it’s been done, however getting my hands into the guts of some code of mine seems a good way for me to get a concrete grasp on the matter. I also have a few specific implementation ideas I wanted to try.

I was thus wondering, do any of you guys see an interest in any of these approaches:

  • Minimizing the signature’s weight of dendritic segments:
    • one “main” address per segment, leaving only a small relative index to the presynaptic cell to be encoded on the synape.
    • very tiny (like… 3 bits => 8 states) version of the synapse permanency value, using a stochastic increment/decrement instead of float values.
  • Obviously topological (since dendritic segments following the proposition above are anchored to a subset of the possible inputs), with the following optional twists:
    • either semi-continuous (by semi I mean, stuff like inhibition are still somewhat based on fixed tiles to reduce the performance costs, however segment themselves can overlap these tiles), or
    • possibly fully continuous, and implementing inhibition mechanisms with… like a heatmap ? instead of sorting.
    • and … an hexagonal lattice for minicolumns and by extension, columns. This brings some complexity and would impact performance somewhat… and I know current hype around hex is mostly due to grid-cell, which is not meant to relate to a topology of the cortex itself, but rather on locations of the environment… However, would it be plain stupid ? There may be something interesting happening here, and at least when we envision those columns as tightly packed small cylinders as in Matt’s videos of the neocortical “napkin”… we could see an analogy. There could also be a real physical analogy if inhibition chemistry/connectivity involved broadly circular patterns.

Besides, I’ve always loved hexes :blush:

I’m not promising any result any time soon, mind you… just wanted to check first if any of these ideas sounded nice to someone (other than myself).

You don’t have to put any “extra” effort into forming grid cells. Lateral mutual excitation is sufficient. Inhibition (like your fixed tile suggestion) can be helpful.
I describe this in great detail here:

If you carry on doing your own thing in C++ you might want to look at this:

1 Like

Awesome stuff, thanks, Bitking !

In light of this, my few implementation ideas do not seem so novel anymore.
Hex lattice could still help in fast-approximating a fixed range for axons?/dendrites without computing an euclidian distance… hmm… although this is not such a big deal as it is mostly an init-phase, or very low frequency computation. Oh well…
[If circular ranges are used during inhibition however, this may be a powerful optimization].

I’m not finished reading through all this, but I can see it’s quite on the spot with my current reflection. Thanks again.

1 Like