I’d like to disagree on this conclusion due to the fact that a well-trained SP will result to a good classifier. A good classifier can distinguish patterns from sets of inputs hence it will also have a stable output (e.g. active columns). The output of a well-trained SP (e.g. SP1) retains only the features that matter the most, hence it will restrict SP2’s input domain, consequently SPN will restrict SPN+k’s input domain. It is similar to how function composition works where the outputs of these functions have smaller domains/sets - f(g(h(i(j(k(\R)))))), k reduces the output k(\R) and so on leaving f with a smaller input domain. I think the experiment and the opposite of the expectation makes sense, at least to me.
Someone just linked me to your post when I asked about stacking layers of spatial poolers and temporal memories. I didn’t think to stack JUST spatial poolers, but from the results on random floats, it looks rather promising.
Now, I’m still very new to this field of research, but I have a basic understanding of the mechanics of HTM components, so I’ll give it my best guess theory as to what’s happening.
What I think is happening is while one layer is able to form a good, albeit “shallow” SDR of the random values, having a whopping EIGHT SP’s all work to encode a structure (could I call it a three dimensional structure?) that’s much, much deeper than a single layer. In my opinion, this could yield more detailed and accurate representations, and thus better predictions and outputs. What the overlap graphs tell me (and again, still new here!) is that the SP stack was able to narrow down an allocation of columns specifically for these values, and with the greater representational accuracy, is much more clearly able to “predict” the value 0.5. Don’t know if predict is the right term here, but it’s clear that the more stacked SP’s there are, the less overlap per given value is.
Of course, I do worry about the classic ML problem of overfitting data, which may or may not be happening here. What happens when you feed it other values, like 0.1 or 0.7? Do you still get a narrow square, or curve, in the overlap graphs?
The role of the SP is to recognise static patterns in encoded input data. Input is sensor data, output is SDR.
The role of the TM is to recognise sequences of SDR patterns over time. Input is a sequence of SDRs.
Neither of these algorithms would be expected to do a good job at recognising higher order static patterns. Rather than stacking SPs, I would expect a new component, specialised for the purpose. It would take two or more SDRs as input, each derived from a different input source, and recognise patterns of input across multiple modalities. You might get that result by just concatenating SRDs and feeding them to an SP, but it probably isn’t the optimal solution.
Image processing is a case in point. One SP might recognise features like lines or angles, another might recognise locations or displacements, both need to be combined to recognise an object. Numenta has done some work on this, but not much published that I can see.
If we think of the stacked SP in this case as an encoder that learns encodings rather than learning for prediction/classification then I think we can get rid of classical overfitting problem.
Another way of looking at this is that the stacked SP’s is an instance of a problem solver in a particular problem space. Now if it is an instance then there must be some other instances out there that need to be discovered, hence the meaning of “overfitting” in mainstream ML becomes irrelevant because in theory Generalization here can be done by consensus of these instances.
What does it mean to “recognize a pattern?”
Who or what looks at this recognized pattern?
I get that you have some pattern of bit in an SDR but how does that do anything useful? That is just as arbitrary and meaningless as the pattern of bits in the original image. Further, for image recognition forget anything you have learned about image segmentation you may be familiar with as none of those techniques work with saccades.
If you think about that you may see some of the reason that I see pattern recognition a little differently than you do. As a thought provoker - I see the visual stream as staying distributed entirely to the association regions. There is image processing but it stays local. Combine that with the hex-grids I have described elsewhere and you will be on the path.
As I see it - the palimpsest of images ends up in the association regions as a stable hex-grid, refined and reinforced by each successive wave of saccade data. Sort of a game of 20 questions - but with scraps of images.
Hmm, interesting, so if a ‘Hex Grid’ is to be used as some form of stable representation of a current scene / image (composed of several saccades) how do you see its construction in terms of time and space (or hierarchy of small windows combine as you move up).
So with a flash of a single number like ‘3’ I can envision how a hex grid may form from this single instance into a stable representation, or if this image is simply repeated. I can even see how you could also split the ‘3’ into a grid of smaller regions of which, their focus is on smaller areas and these settle into a single ‘Hex Grid’ each and combined later.
But creating a stable ‘Hex Grid’ in a layer of cells / columns where you are constantly switching up the input as saccades go on seems highly difficult to go about does it not? Even if it is biologically correct.
Assuming that we have the stable representation forming in the association area, each node in the grid is following its own sequence as the saccades progress.
Thinking of one of the most basic and highly tuned human skills, face recognition. The saccade scan pattern is highly stereotyped for any given individual.
First we get the corner of an eye and lots of mini-columns try to match the static pattern. (proximal inputs) Many apical dendrites will also try to predict something.
After the first saccade to the next feature the next presentation the sequential predictions will not match as many features and with lateral voting many possibilities will be eliminated. Each saccade will strengthen the confirming sequences and eliminate the sequences that don’t match.
What you should end up with is a stable hex grid that means John and not Mary.
Similar objects will have similar hex-grid patterns and different objects will have very different hex-grid patterns. The patterns that are similar should have the same basic grid shape with the bits around the edges of the pattern differing to set the object apart.
I think that things like number and letter recognition involve multiple maps working together and the explanation gets too long to go through in a post like this - but the basic principle is the same; consider that speech in humans involves several maps in both Broca and Wernicke’s areas over and above simple object recognition.
Going the other way - towards the senses - as Jeff Hawkins has mentioned before there is an increase in connections as you go form V1 forward. What I think is going on is processing to split out and enhance features and present a rich cocktail of features to select from (edges and such) so that the hex-grids have the most contrast to use to form a pattern.
The feedback connections from the hex-grid act as a filter to focus this feature extraction; perception is active recall of learned features.
Beyond lateral connections among Hex Grid minicolumns / cells, do you see any other feedback which could possibly be seen as reward for correct categorisation of an input. HTM SP through use of k-winner takes all, SDR representations / output and a simple KNN classifer is capable learning MNIST for example without any direct feedback. But I’ll be honest getting the same degree of classification accuracy seems a lot harder through my current Hex Grid approach given the restrictions on spacing and scale.
On a side note do you see Hex Grid formation as much a result of attractor network based self-orgnisation, and if so what process in your view drives this, or is it simply proximal input and lateral input + inhibition.
As far as MNIST performance - that is with a single layer. I think you need an eyeball to look at different parts of the image and V1 through x to do feature extraction. You will need a sub-cortex to point the eyeball. Don’t forget the variable spacing as you move radially from the center of the fovea,
Rewards question? Why - yes I do!
The prediction of the next step in a sequence is a local feedback.
The map to map connections are another form of feedback.
And yes, hex-grids are attractor networks, just as you describe; simply proximal input and lateral input + inhibition. Why make it more complicated than necessary?
I will add that I see L2/3 as strictly sparsity and grid forming and L5 as strictly predictive with L4 as a traffic cop between the layers and the thalamus. This varies sharply from Numenta cannon. While we are at it - I see L6 as predictive in the feedback direction as described in the “Three visual steams” paper.
The interaction with the thalamus adds considerably to this model but this is enough for this post.
I have nothing against it at all, its just proving difficult to figure out a process / structure / schema etc that can essentially form unique hex grid representations tied to specific input that also generalizes well. Getting a fixed hex grid using using something akin to HTM SP (but in this case more temporal) is doable, but i often find that there is difficulty in differentiated lets say the two classes of ‘3’ and ‘8’ given their spatial overlap.
I originally thought some kind of reward approach during training to penalise / reward the developing hex grid into the correct category might work but it felt contrived / forced and less of a self-organizing approach.
I find the concept of stacked SPs troubling. How would that actually look in the biology?
If you compare the reach of the dendrites from the cells that make up a mini-column it is about the same or smaller than the area of a winning mini-column in the pooling competition. This cell in this mini-column has won the fight and its output axon will be the one that projects to the next map to say “we decided that what I have learned is the best match the pattern that we all sensed.” Everyone close to me must stay silent.
So what is being sent to the next layer/map is spatially sparse to the degree that the column in the next map sees at best - a few sparse points of activity. This does not sound like the cluster of axonal activity within a 40 micron stretch needed to activate a dendrite as is thought to be required.
Here is my concern in a pictorial depiction; assume for this picture that there is some pattern projected to the map on the right and it has formed the winning spatial pools shown. You can see that the projections are very sparse in relation to the scale of in size of a single SDR on a dendrite. How will these combine to form an new SDR in the target map?
If you allow that only one axon/dendrite pair is needed to fire a cell and the SP layers are stacked one-to-one (topographical aligned) then I don’t see how this is more than a bucket brigade at best. I don’t see any possible mechanism for generalization.
Now - if the target map has two or more maps projected to it things could be very different …
Very informative thanks especially the biology part. I do not know the answers of the questions though.
At least for me, I don’t see this exercise (stacked sp’s) is for generalizing patterns or detection. I see this as more of another way of extracting features, similar to CNN’s learned kernels. In this case, the learned kernel is the whole stack of sp’s and the features are the outputs. IOW I think the output can be used to encode inputs. Next step that is interesting to me is to find/discover/search these species/instances of stacked sp’s and somehow network them to form a stronger feature extractor. Easier said than done of course.
I’m curious about this alternative configuration which is from my best knowledge is more biologically plausible. The reason why I asked (above) about whether the output of the SP is the set of cells or the single columns. I cannot see the former being implemented in the source code above, neither if they were topologically aligned.
I agree with your concern, although I don’t entirely follow your argument. It’s very clear to me from the relevant paper here that the SP takes encoded sensor bit patterns as its input and not SDRs. This is shown in the experimental findings, proposed algorithm and confirmed by model development.
The HTM spatial pooler is designed to achieve a set of computational properties that support further downstream computations with SDRs. These properties include (1) preserving topology of the input space by mapping similar inputs to similar outputs, (2) continuously adapting to changing statistics of the input stream, (3) forming fixed sparsity representations, (4) being robust to noise, and (5) being fault tolerant. As an integral component of HTM, the outputs of the SP can be easily recognized by downstream neurons and contribute to improved performance in an end-to-end HTM system.
I agree absolutely with the need for a mechanism to combine or stack SDRs, but the SP is not it. A full HTM theory needs (at least) this further component, as well as models to deal with location and timing.
It seems this post is already past the idea of stacking spatial poolers and that it has been already demonstrated with interesting outputs (graphs above). I appreciate the suggestions/recommendations such as using another specialized component to recognize higher-order patterns and combining sdrs as inputs to the next layer/map. Could you also please share your insights if there are any as to why the stacked spatial poolers has these interesting outputs? As I’ve understand the OP is seeking some explanation as well about the demonstration. I’m also interested in other interpretations/analysis. Thanks.
The corticocortical axons shown above are projections to other maps in the cortex. This is how one map talks to another. In the population of neurons in the cortex the projected map-to-map axon reaches the new map and turns upward to go through all layers, possibly making contact with proximal dendrites to end up in layer 1, the dendrites of the apical layer. See the corticocortical fiber in blue shown here.
Note the green basket cell in the image above. It is mostly ignored in portrayals by Numenta, mentioned in passing as inhibitory interneurons to be simulated by k-winner. The extent shown below is roughly what would be excited by the lateral axonal projections from this cell - in a circular pattern around the cell.
Another aspect that seems to be described but seldom portrayed is how utterly dense all this is and the relative scale of a SDR to this whole structure. Let’s start with adding the basket cells:
If a lateral axon stimulates a basket cell as it goes through it goes off like a bomb, sending inhibitory signals to all the pyramidal cells within its reach. I did not put in pyramid cells from other column as the picture would be solid red vertical lines. Note that since these lateral axons start below the basket cell but reach up into them for a relatively far distance it affects many other surrounding cells but does not turn itself off. This give the classic center on/surround off that is so widely documented. All the cells receiving some stimulus from axons are firing and trying to suppress neighbors all at the same time. Only the cells with the strongest input win this competition and stay active. The rest are shut down. Note the large spatial extent of lateral axonal projections; about 3 millimeters maximum. Compare this to the much smaller reach of a dendrite where the SDR is formed. All of an SDR is formed on a single dendrite; very small in relation to the finished spatial pooler action.
@david.pfx Note that this is not just the input from the senses - this is in all cortex, in all maps. HTM is the universal computation of the entire cortex.
This is incorrect. It can process any type of binary input, whether it be sensory input or output from other parts of the brain. It can be an SDR or a dense array (although SDRs are better) as long as it contains consistently encoded semantic information. A primary strength of Spatial Pooling is that is it does not know anything about the structure of its input. It learns that over time.