Is input prediction the actual purpose of a column?

A TM at time t is simply trained to predict its input at time t+1.
From ML perspective, this is very restrictive.

Let’s take reinforcement learning (RL). Which, if someone aims to replicate biology should not ignore. Because RL is attempting to produce agents which learn/optimize behavior within their environment. Like biology does.

For this purpose past-dependent prediction abilities of TM have great potential.

In RL you have an observation as input and agents must (learn to) respond with an optimal action, which maximizes a future reward.

The three - observations, actions, rewards - are semantically different. There are various strategies in RL, the problem is most of them do not care about explicitly predicting the input at t+1, they try either to predict directly a “good” action or to search for an optimal action by by estimating the values of various actions in the context of current observation.

Only In some models the algorithm predicts next input (observation) from current observation and potential action which is later valued.

The only way one can use TM for RL is to input it either

  • (observation, useless prev action)
    to recommend
    (next observation, next action)
  • (observation, potential action, past action value)
    to predict:
    (ignored observation, step-ahead action, next value)
  • In two steps:
    (observation, action) → (next observation, future action)
    (next observation, past value) → (next next observation, value)

With bold above is marked what is either needed as input or desired as output and italics for either I-wish-I-could-ignore inputs or have-to-ignore outputs.

Because a faithful TM must obey the biological model of predicting only its inputs and to have as input signal for every of its output signal.

The usual response to this problem is “so what, put everything as both input and output and there you have, you can do RL with a nice, biologically faithful, square TM”

But the problem with that is observation vector can be orders of magnitudes larger than action vector or the single scalar value.

The majority of columns are wasted (computation and memory) to predict something that isn’t actually needed by the agent.

Algorithmically/computationally there is no actual constraint against some other topology than the strict rectangle of rows&columns.

When one asks “Why isn’t HTM more popular in ML/AI?” - here-s one good reason. AI is resource demanding for faith reasons instead of algorithmic ones.

And now the big question mark in the title:

What if the actual purpose of a column is not to predict its own future input?

Imagine you are god or nature trying to build a brain with wires and electric signals (aka neurons). For that you figured out columns which take several inputs as cues in order to predict:

  • “I think I’ll see a cup!”

Why would it need a following reply stating:

  • “Yes you now see a cup!”

? Makes sense ?

And there is simple reason for that is when your mind is built up exclusively of electric wires, you have to physically wire the reinforcing signal back to whoever made the correct assessment.

The actual dialogue might be:

  • “I think I see a cup!”
  • “good boy!”

TLDR what we assume as input in biological columns could be just a reinforcing (learning) feedback simply because it needs one in order to improve its prediction.
A high five for taking a good shot.

In software we don’t need that – we can directly update synapses based on correct/incorrect predictions of corresponding neurons/cells/columns


Agreed with your conclusion except that I’m not sure if you’re right in alluding that predicting next observation is inefficient. I’m thinking that the brain uses that training mechanism to create world models that it can then use to simulate multiple future scenarios (sets of sequences of observations) to take best course of action? It’s sort of like training auto-endocer in DL. Not sure if creating world models is the most efficient thing for the brain to do (considering if world models have much use in the brain) or if there’s another way to create world models without doing something like an autoencoder but I’m guessing that that’s what the brain is kind of doing and therefore it’s at least somewhat efficient.



here are a few clarifications.

  1. My title and follow-up is inadvertently exclusive. I meant an OR not an XOR between the two. Since the anatomy of the column is identical when it predicts its next input or a correlation somewhere else across the cortex, then the most likely case is some predict next-step input and others are used in a more… introspective feedback. Like a generic component suitable for multiple uses.
    It should read “Is input prediction the only purpose of a column?”
  2. The world model hypothesis above. If we look at actual implementations we see the purpose of WMs is to shrink input data. That’s what the encoding half of the autoencoder is used for - to shrink a 10kor 1M pixel image to a manageable vector of 250 or 1000 “features”.
    Why? Because both network size and time needed to figure out correlations in an input frame increases quadratically with the size of that frame.
    And that’s why I’m pretty sure that having an “autoencoder” that not only can not shrink its input but also it just anticipates it with… what? 5ms anticipation makes very little practical sense. You need… 100 TMs stacked on top of each other for a meager 0.5 seconds anticipation of future?

If you want to stick to the standard TM purpose you have to hypothesize either

  • some extra component apart from TM that does the compression
  • some hard to imagine way of wiring/stacking TMs that somehow resolves the shrinking.

While simply dropping a single assumption about the minicolumn’s purpose opens it up to a whole lot more possibilities without changing its anatomy. . While those lot more possibilities are much more attractive conceptually to a whole lot of ML/AI researchers/designers/inventors. Including the master inventor of all, evolution itself.

1 Like

In hierarchical scheme there is no difference, higher-level column input is always from lower levels of the cortex, mediated by higher-order relays in thalamus.

Compression here would be through competition among higher-level columns, there should be fewer of them than on lower levels. So the nodes in “autoencoder” here are columns rather than neurons.
Then there is temporal compression: one higher-level column is likely to correspond to a succession of previously extinguished lower-level columns.

The laziest inventor imaginable.

1 Like

In the animal brain, the physical sense organ presents sensory input to the brain in a compressed format. The retina and optic nerve, the tympanum and auditory nerve, the skin and the spinal cord have evolved over millions of years to make good use of available bandwidth and latency. The cortex builds on what went before.

A TM-like model that deals in SDRs cannot expect to receive those directly from sense organs without some kind of intermediate processing, surely? I think your ‘extra component’ is inevitable.

1 Like

Selective quote extraction / manipulation

Look at it from a time propogation effect as to how LTP and LTD effects work at various frequencies of firing and the resulting effect these shorter or longer intervals have on the preceeding and post synapses in a network. Higher frequencies change what wires together as it potentially bridges together sequential firing points in a network to within LTP or a new synapse. I believe (and testing) that this is part of the way in which hierarchical concetps are formed (part of what sleep is for). The current methods do not necessarily embed a relative time into the network - i.e. effects of signal decay interaction with differing firing sequences and frequencies. Imagine what REM parallel memory recall effects would have on the signals and network relative to LTP effect. At this point I go back to your quote “hard to imagine”.

The real key I believe is that our biological internal sequence is not always temporally aligned to the external sequence of events. Our internal representation of “now” is not just the current sensory state, it includes replay, which is in itself compressed and more parallel than the original temporal stream of sensory input (and internal feedback loops). Where and how does the replay interact with new sensory input ? We need replay, otherwise a few hundred ms later all the signals have decayed and we have forgot our current state.

Verbal and written language relies on temporal shifting. Think of how words in a sentance are understood and the effects the last word can have to create a thought recursion as to what passed previously you idiot. :slight_smile:

1 Like

I like that, yeah, I’ll try this next, doh, this next, doh, this next, doh… Maybe if Darwin was a bite later he would have called it the Homer development cycle.

1 Like

I mistook your post’s meaning. I don’t really know the answer but my personal understanding is if columns are predominantly working in hierarchical manner, then only some of them would predict the (encoded/processed observed/raw) inputs while others would predict the intermediate/hidden inputs it would receive from other columns in the hierarchy. The subcortical structures are mainly there to facilitate such predictions, I think.

And that’s why I’m pretty sure that having an “autoencoder” that not only can not shrink its input but also it just anticipates it with… what? 5ms anticipation makes very little practical sense. You need… 100 TMs stacked on top of each other for a meager 0.5 seconds anticipation of future?

I share bkaz’s thoughts above on the shrinking part. Very well summarized. So I’m assuming it can shrink inputs over time (sequential input) and space (e.g. images) for example. Like it may have a fixed SDR to describe an observation across some short span of time or space or abstractions. I’m not sure if this is feasible using the ‘standard TM purpose’ you mentioned so we may be thinking about different types of TM with different features. Just wondering how you came across (what i assume is a ballpark figure of) 100 TMs stacked together?

1 Like

Am i late to the discussion? but is it understood that predictions made from the perspective of the individual need to include those of people in the vicinity. So the agent will make all predictions and observations of its own from the perspective of other people.
Which is where the internal monlogue derives from. Predicting what other people will do (including speech acts) from their perspective and learning a model of reality from it. Thats why accents are local in the sense that you will speak like an american if raised among american people.
This would demystify the internal monologue as just another model that is being predicted.

1 Like

In fact part of my work deals with a gan like system where the agent treats its own behaviour as fake and other humanoids/us as real.

It then tries to fool a discrminator into believing its real by modifying its actions in the direction of being more humanoid.

In my system I form a grammar from activations and if the discriminator fires with the rules of that grammar then the input frames from the agent’s vision would have shown human behaviour. Yet while the agent is by performing actions by itself the discriminator is trained to induce random activations.

During the robot training phase, the robot must act by itself and satisfy the rules of the grammar as that is what is rewarded by the global reward (i found a way to).

Additionally the agent is fed a binary vector to its network showing the activations of the discriminator (assuming a spiking model).

What this does is , not only does the agent have feedback in the form of the reward signal,The particular activations of the discriminator
contain information on exactly what the agent has to change in order to solve the reward maximisation problem. This is written in the activations with the grammar and grounded in the reward.

If this is input as well it can help guide the agent towards human like behaviour./

Seeing other peoples viewpoint, while it sees the other people from its viewpoint, induces grid and place cells somehow is my hypothesis.

1 Like

I just read my last post and its unclear.

There is a discriminator that doesnt output real of fake the traditional way.

It consists of a spiking network, and the pattern of spikes fires with the grammar for frames showing humanoid behaviour after training the discriminator. While firing randomly when shown frames of the robot in action.

I will say what that grammar is. Nodes are assigned frequencies according to the layout of a musical keyboard, and activations that produce the most consonance are rewarded while training on the real humans and activations that produce the most dissonance are rewarded when training on the robots actions.

Then with this system we train the robot by passing its actions in a visual format as frames to the spiking network just as we did before. But this time training the robot to maximise the consonance in the spiking network.

Doing this can only happen if it act more and more like the real humans.

We know that different articulations of the grammar in the spiking network may produce similar consonance levels. But they occur at different points or after being shown different activities. After training the discriminator the activations would contain information related to how far short the agent is from getting the maximum reward.

if we input the discriminator’s activations as well to the agent (as a binary vector) it will have access to this information as well, almost like it’s being told what to modulate in a grounded language. grounded in the reward.


Higher order cortical regions can receive direct sensory input, so rather than only predicting intermediate/hidden inputs, it could be a mix of those inputs with the more raw inputs.

Higher order regions don’t necessarily receive an exact duplicate of the raw inputs which the primary cortical regions receive, though. It can come from different kinds of sensory receptors, and/or be organized more diffusely / larger receptive fields.

Maybe part of the solution is the object layer. It doesn’t need to see all parts of the object to represent the full object. In that sense, it can predict very far into the future, because it’s predicting parts of the object.

Also, temporal memory and the object layer work quite similarly, so perhaps merging them somehow could allow long-term sequence prediction. I think @dmac talked about merging them somewhere.

Another thought. The object layer is linked to attention (it represents one object at a time, I guess), which is linked to RL.

Replay is an interesting way to frame feedback loops, as opposed to just global oscillations.

There’s also phase precession in the hippocampus / EC system, where while walking thru overlapping place fields, each theta oscillatory cycle, the sequence in which it entered those overlapping place fields replays.


I see. Just wondering if all cortical columns receive direct/lightly-processed sensory inputs or do some cortical columns receive input only from other columns in place of any sort of sensory inputs? If all columns receive direct/lightly-processed sensory inputs then that could mean that the ‘height’ of any hierarchy would be quite shallow since I don’t think low level inputs and highly processed information mixes that well.

1 Like

Thanks all for the feedback. The one from Bkaz’s (or presence?) clicked with the hint that SpatialPooler is capable of clustering - it can compress a relatively large input frame in a smaller SDR and each bit in that SP’s output SDR potentially represents a different … pattern of several bits in its input frame.

Still, since architecturally what now is deemed in TM as “input” IS what is learned to predict,
I don’t see any reason some columns would should not get input from both “real input” and other columns get it from somebody else’s (e.g. some other TM block or “full column”) expectation.

And so the parallel signal from top top becomes some desired thing from the task: “I want you to look for whatever predicts this thing”

Makes much more sense to me.

Take a look at Forward-Forward idea - it is an alternative to backpropagation proposed by Hinton.
Instead of backpropagating a desired “label” from top to bottom, it inputs both labels and inputs and the network is adjusted from bottom to top in such a way that when input and labels are paired correctly the next level activations are more likely than when same input is paired with the incorrect label.

So in certain contexts (like NO backpropagation) it actually makes sense.


There’s probably not a rule that all cortical regions receive direct sensory input. Especially in primates, which have a much deeper hierarchy than rats/mice.

The first level of the hierarchy and higher levels have very similar circuits, so it sort of needs to be able to mix. Although, the signals might not arrive at the exact same time. Maybe the sensory signal is mostly just for attention, and then the slightly later signal from cortex fills in details.