Experimenting with stacking Spatial Poolers

Beyond lateral connections among Hex Grid minicolumns / cells, do you see any other feedback which could possibly be seen as reward for correct categorisation of an input. HTM SP through use of k-winner takes all, SDR representations / output and a simple KNN classifer is capable learning MNIST for example without any direct feedback. But I’ll be honest getting the same degree of classification accuracy seems a lot harder through my current Hex Grid approach given the restrictions on spacing and scale.

On a side note do you see Hex Grid formation as much a result of attractor network based self-orgnisation, and if so what process in your view drives this, or is it simply proximal input and lateral input + inhibition.


As far as MNIST performance - that is with a single layer. I think you need an eyeball to look at different parts of the image and V1 through x to do feature extraction. You will need a sub-cortex to point the eyeball. Don’t forget the variable spacing as you move radially from the center of the fovea,

Rewards question? Why - yes I do!
The prediction of the next step in a sequence is a local feedback.

The map to map connections are another form of feedback.

And yes, hex-grids are attractor networks, just as you describe; simply proximal input and lateral input + inhibition. Why make it more complicated than necessary?

I will add that I see L2/3 as strictly sparsity and grid forming and L5 as strictly predictive with L4 as a traffic cop between the layers and the thalamus. This varies sharply from Numenta cannon. While we are at it - I see L6 as predictive in the feedback direction as described in the “Three visual steams” paper.

The interaction with the thalamus adds considerably to this model but this is enough for this post.

I have nothing against it at all, its just proving difficult to figure out a process / structure / schema etc that can essentially form unique hex grid representations tied to specific input that also generalizes well. Getting a fixed hex grid using using something akin to HTM SP (but in this case more temporal) is doable, but i often find that there is difficulty in differentiated lets say the two classes of ‘3’ and ‘8’ given their spatial overlap.

I originally thought some kind of reward approach during training to penalise / reward the developing hex grid into the correct category might work but it felt contrived / forced and less of a self-organizing approach.

I really do think that the deep learning people are showing the way here.
You need a bunch of very simple well-defined operations that build to a solution.

In this case, one of the joys of lateral connections is that varying the relations between the lateral connections and the inhibition gives you both Garbor filters and hex-grids. Two layers.

You can do more with V2 layer in regards to feature extraction. Three layers.

You know that the eye moving gives you a very powerful version of a kernel; centering the digit is a very common MNIST tool. (you can think of this as a layer)

The variable spatial scaling built into the retina is yet another powerful tool. (Yup - layer)

(Thank you unix way!)

Trying to do everything with a single structure is very limited; as you pointed out it ends up being very kludgy.

I found the relevent discussion about stacked sp - Network of SpatialPoolers

I find the concept of stacked SPs troubling. How would that actually look in the biology?

If you compare the reach of the dendrites from the cells that make up a mini-column it is about the same or smaller than the area of a winning mini-column in the pooling competition. This cell in this mini-column has won the fight and its output axon will be the one that projects to the next map to say “we decided that what I have learned is the best match the pattern that we all sensed.” Everyone close to me must stay silent.

So what is being sent to the next layer/map is spatially sparse to the degree that the column in the next map sees at best - a few sparse points of activity. This does not sound like the cluster of axonal activity within a 40 micron stretch needed to activate a dendrite as is thought to be required.

Here is my concern in a pictorial depiction; assume for this picture that there is some pattern projected to the map on the right and it has formed the winning spatial pools shown. You can see that the projections are very sparse in relation to the scale of in size of a single SDR on a dendrite. How will these combine to form an new SDR in the target map?

If you allow that only one axon/dendrite pair is needed to fire a cell and the SP layers are stacked one-to-one (topographical aligned) then I don’t see how this is more than a bucket brigade at best. I don’t see any possible mechanism for generalization.

Now - if the target map has two or more maps projected to it things could be very different …


Very informative thanks especially the biology part. I do not know the answers of the questions though.

At least for me, I don’t see this exercise (stacked sp’s) is for generalizing patterns or detection. I see this as more of another way of extracting features, similar to CNN’s learned kernels. In this case, the learned kernel is the whole stack of sp’s and the features are the outputs. IOW I think the output can be used to encode inputs. Next step that is interesting to me is to find/discover/search these species/instances of stacked sp’s and somehow network them to form a stronger feature extractor. Easier said than done of course.

I’m curious about this alternative configuration which is from my best knowledge is more biologically plausible. The reason why I asked (above) about whether the output of the SP is the set of cells or the single columns. I cannot see the former being implemented in the source code above, neither if they were topologically aligned.

I agree with your concern, although I don’t entirely follow your argument. It’s very clear to me from the relevant paper here that the SP takes encoded sensor bit patterns as its input and not SDRs. This is shown in the experimental findings, proposed algorithm and confirmed by model development.

The HTM spatial pooler is designed to achieve a set of computational properties that support further downstream computations with SDRs. These properties include (1) preserving topology of the input space by mapping similar inputs to similar outputs, (2) continuously adapting to changing statistics of the input stream, (3) forming fixed sparsity representations, (4) being robust to noise, and (5) being fault tolerant. As an integral component of HTM, the outputs of the SP can be easily recognized by downstream neurons and contribute to improved performance in an end-to-end HTM system.

I agree absolutely with the need for a mechanism to combine or stack SDRs, but the SP is not it. A full HTM theory needs (at least) this further component, as well as models to deal with location and timing.

@david.pfx @Bitking

It seems this post is already past the idea of stacking spatial poolers and that it has been already demonstrated with interesting outputs (graphs above). I appreciate the suggestions/recommendations such as using another specialized component to recognize higher-order patterns and combining sdrs as inputs to the next layer/map. Could you also please share your insights if there are any as to why the stacked spatial poolers has these interesting outputs? As I’ve understand the OP is seeking some explanation as well about the demonstration. I’m also interested in other interpretations/analysis. Thanks.

Starting with a little basic neurobiology refresher:

“Pyramidal neuron axons send many “recurrent collateral” branches laterally to neighboring areas of the cortex, the basis for both lateral inhibition and lateral excitation.”

The corticocortical axons shown above are projections to other maps in the cortex. This is how one map talks to another. In the population of neurons in the cortex the projected map-to-map axon reaches the new map and turns upward to go through all layers, possibly making contact with proximal dendrites to end up in layer 1, the dendrites of the apical layer. See the corticocortical fiber in blue shown here.

Note the green basket cell in the image above. It is mostly ignored in portrayals by Numenta, mentioned in passing as inhibitory interneurons to be simulated by k-winner. The extent shown below is roughly what would be excited by the lateral axonal projections from this cell - in a circular pattern around the cell.

Another aspect that seems to be described but seldom portrayed is how utterly dense all this is and the relative scale of a SDR to this whole structure. Let’s start with adding the basket cells:

If a lateral axon stimulates a basket cell as it goes through it goes off like a bomb, sending inhibitory signals to all the pyramidal cells within its reach. I did not put in pyramid cells from other column as the picture would be solid red vertical lines. Note that since these lateral axons start below the basket cell but reach up into them for a relatively far distance it affects many other surrounding cells but does not turn itself off. This give the classic center on/surround off that is so widely documented. All the cells receiving some stimulus from axons are firing and trying to suppress neighbors all at the same time. Only the cells with the strongest input win this competition and stay active. The rest are shut down. Note the large spatial extent of lateral axonal projections; about 3 millimeters maximum. Compare this to the much smaller reach of a dendrite where the SDR is formed. All of an SDR is formed on a single dendrite; very small in relation to the finished spatial pooler action.

@david.pfx Note that this is not just the input from the senses - this is in all cortex, in all maps. HTM is the universal computation of the entire cortex.


This is incorrect. It can process any type of binary input, whether it be sensory input or output from other parts of the brain. It can be an SDR or a dense array (although SDRs are better) as long as it contains consistently encoded semantic information. A primary strength of Spatial Pooling is that is it does not know anything about the structure of its input. It learns that over time.


@rhyolight I have been classifying the SP as a dimension reduction algorithm. But this behavior seems to indicate that it is also a clustering algorithm. What do you think?


Sure. It can take many different “meaning vectors” encoded into one space and distributed their joint meaning across another space.


Post 35 updated:

Let me share my thoughts.

as Dimensionality Reduction algorithm
The SP intuitively can be considered as a dimension reduction algorithm, however, it is agnostic to the dimensions and structure of the input, hence is unsupervised similar to an autoencoder. The SP uses a metaheuristic algorithm rather than gradient descent, hence it is not searching for something, it’s simply reorganizing itself.

as Classifier, Clustering algorithm
In most use cases of the SP, it is being used as a classifier. However, quite agnostically the SP is really just encoding the inputs as a result of the algorithm (see above as Dimensionality Reduction) - it doesn’t care/know about the meaning of the inputs. Because it encodes inputs, and the output set (number of columns) is usually forced to be smaller than the input size, then it reuses encodings for inputs that are semantically similar. Hence, the result are groups/columns instead of individual encodings. But note these groupings have no meanings at all. For users like us using the SP, we perceive them as classifications by putting labels on them using different algorithms (e.g. softmax). Therefore, the SP is simply clustering per definition, because it groups inputs, and classification is really only realized when another algorithm interprets its groupings.

Encoder, Feature Extractor
Same as above as Dimensionality Reduction Algorithm, it is an unsupervised encoder. Encoding involves encoding of core features in reduced space (e.g. compression), this is why it is also a feature extractor because due to space constraints it will be forced to extract the bits that matter most.

when Stacked together
When stacked together think of a stacked encoder.

It is counterintuitive to think about generalization here because encoders don’t necessarily generalize with regards to the meaning of generalization in ML. My personal opinion, seeking for generalization is a double-edged sword, when an algorithm is DNN it is ok it works for now, but otherwise it is fiction.

In the DL world, the stacked SP is similar to a convolution kernel. Why? The kernels (they are many by the way) in CNN extracts features and they are learned in an unsupervised manner, the stacked SP is learned by SP training (unsupervised) and when tested they encode/extract features. The stacked SP extremely (at least for this example) stabilizes its outputs/groupings, hence it is a better encoder and intuitively much similar to an autonecoder, hence it is not good for classification tasks,


I’m curious, are you suggesting that the active columns as SDRs in HTM SP output are not biologically plausible?

AFAIK, the kernels in CNN are learned by gradient descent.
How is that unsupervised?
Am I missing something?

You are correct. My statement was counterintuitive, sorry ESL I tend to sound naive when I use it. The main business of the CNN is to learn the “best” features by GD and I think this is also the basic goal of a supervised ANN. There is ground truth to what is A or B (e.g. a cat or dog) at the start and end of a supervised learning (e.g. CNN), however there is no ground truth (fully-labeled example) of a kernel. The kernels are learned along the way, this is what I meant about unsupervised. Depending on prespective, most CNN users really don’t care about kernels but care more about the set of extracted features which are the ones being used in the next fully-connected layer - these features are learned with supervision because they are compared to the ground truth. I would say its probably both supervised & unsupervised in a lower level perspective.

1 Like

Connecting to every node in the array is not biologically accurate. I have whined about this in the past but this is defended as being close enough that training fixes the discrepancy.

Also - the models that are being built are not being connected to sensors where the topology is that important.

The HTM models in use now are small enough that the divergence is not critical. There is some topology setting that would become important if the models become larger.

Even then, the deviation from the linear nature of a dendrite may also be missing some important part of the biology. I could see that two mini-columns that have dendrites passing each other in opposite directions could have some useful mutually reinforcing behavior.


Thanks so much for your explanations. Most of the time when I read them I feel like I’m shrinked and suddenly dragged in a dark room where voices can be heard. I can only listen and try to open my eyes as wide as I can so I can see what I’ve hopefully understood. I think the dark room is neurobiology and I’m far from that context. I appreciate though, don’t get me wrong, it’s fun to play with our imaginations.

As far as I can tell there is not yet any HTM equivalent for this reinforcing behavior? Do you think this can be achieved by another “specialized” HTM component or is this just a matter of rearranging the existing htm components in some form. The results shown above by the OP also shows that rearranging these htm components may result to some interesting discoveries.