Stable-Enforced Pooling

gmirey · June 14, 2018, 9:47am

I’ve been toying with an idea lately. It came from thinking at the crossroads of discussions with @Paul_Lamb about sub-sequences, with @Bitking about me proposing sup. learning by hierarchy and/or teacher effects (also crossing questions raised by @sebjwallace, also the ‘everything is a Kitty’ idea @MaxLee), and finally the catch-22 status of unsupervised learning in a hierarchy, that is, “what part to train?” (also given that training higher-levels in priority seem to provide better results… but then, how ?).

Now I feel there could be a scheme which takes each of those concerns into account. Let me try to express it.

Prologue

gmirey:

A higher level abstraction is somehow stable. That’s the whole point of it being higher. Their relation to lower levels are twofold :

They provide a context in which they demand of their lower level their entire cooperation. They’ll enforce upon them to reach an understanding (correct predictions) given their presence.

They’re more stable, but not infinitely so. If grunts somehow rebel or outright refuse to reach the understanding it was asked of them, they’ll eventually be overthrown.

So, the question of “which level actually learns from a surprise” boils down to “which is the level just before the despot which stays in power” (and also, despot himself will learn to progressively integrate incoming lower level patterns as a subsequent support to his legitimacy, as long as he stays in power)

Now for maybe a more reasonable analogy and some of my rationale:

The extent to which a lower level is coerced into keeping learning from its input and ambient context only, depends on how confident the next-higher level is of its own understanding (delusions welcome)

If you listen to a new song… assuming a given area (or layer?) holds a stable representation for the whole song, and that this area (or layer) is self-confident (you know from other parts of your brain that you didn’t switch tracks… it is the same song… it gives legitimacy to the higher level “same song representation” to stay in place). Then, for the lower area actively “listening to” representations of subsets of it (say, notes… but any subdivision scheme would do), the job is to try to get a clue (towards perfect prediction) from the remaining info… since the high-level context doesn’t change, and doesn’t want to change, and keep insisting that he won’t change… however loudly we tell “higher up” that this song is new and we’re quite puzzled of that input. So… In the end, the lower level would have been coerced into wiring his distal dendrites to whatever remaining info was left, to correctly predict next time (assuming accurate prediction is the core job, and NMDA spiking its tool) thus it is likely that it will have learned to recognize a temporal sequence of notes, in the context of that song.

If you loosen the “supervision” of the scenario a bit, then first area which has no good reason to think otherwise would become despotic… Hence “it moves, it’s smaller than me, it’s cute” - That’s a kitty, no matter if we grownups would have rather called the thing a hedgehog. That works both with recognition and learning. Those spiky visual clues would indeed start to wire towards kitty too. Now, if we inject back some form of doubt into the scheme (and here we may loop back to teaching and supervised learning), induced by mommy wording something, and… she certainly sounds sure of herself… and she’s quite insisting… and that doesn’t sound like “kitty” at all, then you may dislodge the despotic “kitty” SDR on the higher area, and start sorting cats from dogs.

Now, I don’t know whether that idea is really worth the wear of my keyboard, typing it… but I realize that for this to work anyway, we’d first need to enforce that high-level stability somewhat. HTM spatial pooler may do some of that already by allowing only the few ‘most’ excited cells to actually fire… but I guess we’d need something a little more auto-sustaining than that… while staying at same overview level.

Stable-Enforced Pooling ?

Then comes Calvin and his grid resonance thing… proposing that small cortical patches would interlock at somewhat fixed intervals (about half-millimeter on a side).

I’m not proposing to simulate the chaotic attractor effect and complex inhibition mechanisms leading to this… nor even discussing whether he’s right or wrong from a neuroscience point of view… rather to salvage this idea and ‘assume that they do’ for our model purposes, so that a ‘stable-enforced-pooler’ is made of tiles, each tile strongly pushing its neighbors to settle down to a same-minded tile.

I don’t have the specifics of this, mind you. But I can see already that there is potential for some computer-side optimizations. And if the big-picture concept of despotic areas in a hierarchy turns out half-as-good as my idea of it, we may have :

Some nice model for HTM-based hierarchies
Which supports online learning in fully unsupervised way
Which allows any amount of supervision (or, more loosely, teacher effect) thrown into the mix.
Which could have an internal drive to solve on its own the sequence-or-subsequences questions

Afterthoughts

Note that subtending all this, I also have personal preferences for viewing distal dendrites as potential sniffers for any contextual info they can access… so it’s not necessarily the HTM-canon of sampling t-1 for TM-purposes. They may sample t-1, they may sample other stuff (such as soon-to-be-published location signal ? BTW, couldn’t location also stand for ‘index in apical-context’ when concerned about sequences) so that all this is solved in an homogeneous manner, not reliant upon the minicolumn thing… I don’t know if this is why I see those things as I see them, or if that’s not too relevant to the matter at hand. Well… Your call, guys.

Also, please consider reading @Bitking’s most excellent introduction on the subject of Calvin & Grids.

Paul_Lamb · June 14, 2018, 12:33pm

You have convinced me that grids could be an excellent approach to pooling. I think I’ll spend some time working out a grid-based pooling algorithm.

Bitking · June 14, 2018, 12:44pm

This gets to one of the major issues in training a hierarchy: how does the training distribute between the layers?
If you start from the bottom up there is no direction to work toward - no guidance. The usual hope is that the training will somehow accumulate until there is some small error and them “spill over” to the higher levels.

If you take the top level as one end of a chain there is a target or attractor to shape the exploration of search space.

At the risk of muddying the waters: That top-level guidance could start in the crude guidance afforded by the subcortical structures (the lizard brain) which are grossly correct from evolution.
As the agent learns it’s body by being pushed by the lizard brain the ground truth of the universe it finds itself in pushes on the “other” end of the various hierarchies so there is a dual force driving the exploration in the search space of solutions. (feedforward and feedback)

Somewhere in the middle of this hierarchy is the formation of grids/Calvin tiles that act to quantize the representation and propagate it laterally across the maps that form the hubs of the cortical lobes. This takes the highly local action of individual SDRs (really - in the brain an SDR can only be formed and expressed across the reach of a single dendritic arbor) and joins them into larger mutually reinforcing patterns.

If I am reading you correctly you are saying that this large stable state is an attractor that forms the boss or attractor in a stable state during unsupervised learning - acting as a teacher. It is flexible and can be trained to a new state if the world does not agree but in the interim - it forms it’s best guess and imposes it on the lower level maps as training material. This is what is compared to the ground truth of the perceived universe.

gmirey · June 14, 2018, 1:00pm

Yes, I guess what I fuzzily perceive is that our messy perception networks are for a large part “feedforward”, but to a point. Yes we’re all aware here that feedback is a large amount of info, but we don’t quite know what to make of it other that envisioning that “they help in recognition too”. But the thing is that they do not appear to benevolently “help understanding”. They’re also quite coercive, because they have reasons to be more stable.

I see this as potentially untying the knot of “how does the training distribute between the layers” when we’re following a pure online-learning scheme as HTM does.
Noting that reaching a stable layer (that… resists and echoes back ?), is both amenable to full-unsupervised (those stable layers will tend to form by themselves) and supervised training modes (model imposes a stable input somewhere close to high areas, or right-away impose the stable SDR to the highest layer).

I’d also put some of your lizardy prewirings in the “supervised” side of the training approach. Why not.

sebjwallace · June 14, 2018, 4:07pm

Stacking SPs should automatically produce stability as you go up. This is the key idea behind dimentionality reduction. If you had a 8-feature space at the top then all the huge amount of input data will be reduced and stabilized to those 8 features.

How about using something like clustering? Essentially doing the same as SP. The representation space is determined by the number of cells in a layer. As you go up the hierarchy you reduce the number of cells so therefore reduce the dimensionality of the data so therefore increase representation stability. The features will naturally fill the hierarchy this way.

sunguralikaan · June 14, 2018, 4:13pm

I have been tinkering about this for some time. The more I read about the grid cells and this forum, it feels like there has to be a part that grids play in pooling. However, I cannot get past some probably basic problems. I’ll just try to brainstorm here.

The grid structure would help a smaller scale pattern (a local grid patch) to enforce its own pattern to the larger patch right? If this is the case, there is obviously a lot of redundancy on the larger patch. If pooling is partially achieved by this redundancy, isn’t the number of possible pooling patterns greatly reduced? The pattern has to repeat itself and if we apply this to an HTM layer (minicolumns make up the grid with local inhibition) the capacity suffers a lot. Does this mean a grid pooling solution has to be above the level of HTM layer? In other words, in a grid pooling solution, are the points on this grid HTM minicolumns or HTM layers (columns)? If the answer is minicolumns then the capacity suffers greatly, if the answer is layers then this pooling solution cannot exist inside an HTM layer. Am I even making sense?

gmirey · June 14, 2018, 4:27pm

Yes

Such concerns weren’t more than roughly brushed for me when I wrote this. But my first playing with the idea with a pen and a sheet of paper, drawing SDRs and input/outputs were indeed pointing towards a hierachical “module” where the grid-pooler cell population is above a more messy processor (HTM-based, although crunching indifferently upon input, context, and state of grid pooler itself…)
Got to this for a number of reasons, not even considering the interesting point you raise here… but maybe this will indeed add to the clues.
I’ll try to come with some diagrams to try and refine all this.

Now some first answer about representation size (but I guess @Bitking can provide more insight)
Grid capacity allows for some amount of fuzziness too. Interlock arises from phenomenons akin to wave interference patterns, driven by lenght of axon and strength of inhibition… so there there is room for some modulation pushing up the capacity a bit.
But quite generally you’re right, that could be a concern.

sunguralikaan · June 14, 2018, 4:29pm

This does not happen with the competition rules in place and especially if you let it run for a while to take full advantage of the capacity available. Actually the inverse happens because of this competition, at least in my experiments where I stack 3-4 on top of each other. I guess parameters maybe tweaked to achieve this but as long as there is a competition there will be specialization to inputs from my intuition. Did you have the opposite observation? I am curious because I actually expected what you wrote to happen because of the reasons you pointed out.

Bitking · June 14, 2018, 4:32pm

Close.
The patterns do not exactly repeat.
This is one of my quibbles with the Calvin presentation. He shows an army of identical banana tiles as the response to a banana.

This is very misleading- each tile is responding to the part of a larger pattern that each tile shares.

Local to a cell the SDRs formed can only reach as far as the longest dendrite. This spans about a 16 column diameter. For the HTM models where we can sample anywhere in the input space this is not grossly inaccurate as our models hardly ever cover a larger space.
In the cortex there has to be a way to combine this relatively small spatial span to cover larger areas. Calvin tiles do this in a most elegant way - adjacent tile vote on the bits of a common pattern in a mutually reinforcing way and this allows a much lower confidence match with the ground truth to be recognized.

What the Calvin diagrams should show is the part of the pattern that the tile recognizes, with each one being different.

This is also a comment on clustering - with this spreading of activation a learned pattern can influence an adjacent tile that has not quite learned a pattern to recognize it and learn it. This is slightly different than independent islands of recognition and dimension reduction.

sebjwallace · June 14, 2018, 4:44pm

Neurithmic have built a hierarchy. They have stability as described. They describe their rules for competition in their paper.

http://www.sparsey.com/
http://www.dtic.mil/dtic/tr/fulltext/u2/1006958.pdf (it will take a while to load)

sunguralikaan · June 14, 2018, 4:48pm

For implementation purposes, if I try to boil this to the “on” tiles of this grid, then the pattern is not the same throughout from what I understand.

So the global pattern is actually a spatial pattern that changes slowly throughout the grid space. This slow or minor change is caused by the interplay dynamics of adjacent tiles. This sounds a lot like local inhibition with overlapping neighborhoods. There seems to be a missing ingredient on my part. How is the larger pattern enforced?

Sorry for the confusion, I was referring to stacking actual HTM Spatial Poolers.

Bitking · June 14, 2018, 4:50pm

See my just prior post - the patterns are not a exact repeat. Each is a response to the both the local pattern and what the neighboring tiles are seeing.

In this case the SDRs are the letters and the tile is a word.

We know from basic HTM theory that the SDRs in a column are capable of recognizing a ridiculously large number of patterns.

We know from the Mosley (et. al) work that the phase, orientation, and scaling varies so the tiles overlap each other. This is the basic word that is formed by these SDRs.

Combining a huge alphabet of SDRs with a large number of tiles leads me to believe that capacity is not going to be an issue.

sebjwallace · June 14, 2018, 4:51pm

As far as I know they work on the same principles. If HTM SPs don’t do this then they should as it is a fundamental requirement that Jeff defined in On Intelligence.

sunguralikaan · June 14, 2018, 4:56pm

Just clear things even further, If we are talking about stacking HTM SPs that have lower minicolumn counts while we go up, then this would obviously happen because the representation capacity drops. It is kind of forced. However if same sized SPs are stacked on top of each other, higher level layers do not become more stable due to competition. If you meant the former in the first place, my bad.

Bitking · June 14, 2018, 4:57pm

One other minor detail, I put the grid-forming cells in Layer 2/3.

The HTM temporal sensing cells are deeper in the stack.

I see the two mechanisms as working synergistically.

My current take on how this works is that the maps with these mechanisms are passing data back & forth through both direct cortio-cortical connections and interaction with the thalamus-cortical connections. The CC connections can be thought of as direct data and the CTC connections are more tonic.

I see that there is a bigger world than the basic HTM model.

The L2/3 cells in the earlier perceptual levels may just be used to enforce sparsity. The early brain maps used histological differences to decide what the brain areas are and these difference are enough that you can easily see them with low power microscopes. It would not surprise me at all that the ratio between the dendrite arbors and inhibitory cells vary from one area to the next so that some can form grids and some just have cleanly defined edges to the sensed object. Since they don’t have to resonate to form tiles these L2/3 layers could run at much higher speeds, at beta or lower frequency Gamma waves. Grid-forming has not been observed in the early stages of perception streams; only around the hubs of each lobe.

So to answer an earlier question - yes, grids would be formed at the higher level.

sebjwallace · June 14, 2018, 5:11pm

Yup, I meant the former, as I was explaining in my original post.

If wonder if there is a reason to not decrease the number of columns as you go up. Perhaps to do concurrent contraction and expansion of feature space? My guess would be that expansion would do the same as auto encoders in that features synthesize/interpolate going up.

gmirey · June 14, 2018, 5:12pm

I’ll need to give your view some more thoughts. Thanks.

One thing already, though : Calvin-like grids could give you “for free” some other benefits, such as auto-solving a “finger voting” scenario of current HTM SMI paper.

sebjwallace · June 14, 2018, 5:20pm

Are Calvin grids from the Cerebral Code book? I’ve been meaning to read it. Actually I’ve also been meaning to read @Bitking’s HTM Columns Hexagonal Grids as that looks pretty cool from a skim through.

From what I understand Calvin focuses on recurrent collaterals. There have been others that suggest that these oscillation dynamics breed a multitude of functions ‘for free’. It is almost like cells within a oscillatory group are polymorphic, in that they are the same cells but take part in many different functions. So it is not like ‘that cell deals with sensory’, or ‘that cells works with motor’, or ‘those cells are making decisions’, but rather they are all doing those things to varying degrees using the same sort of dynamics.

I really recommend the books Rhythms Of The Brain (Buzsaki) and Cerebral Cortex (Rolls) in which they both express this sort of concept.

sunguralikaan · June 14, 2018, 5:25pm

It helped further distinguishing activations from one another via SP. At one point I needed clear cut motor commands for the HTM agent I worked on which took layer activation as input.

sebjwallace · June 14, 2018, 5:27pm

Ah of course, yes. I almost forgot SPs/regions combine lower-level regions together. There should still be reduction though?

Topic		Replies	Views
Geoff Hinton and the Thousand Brains Theory Tangential Theories research	2	1044	July 31, 2023
"On Intelligence" vs recent developments: What's puzzling me (and some thoughts about grid emergence) Numenta Theory	9	2694	March 31, 2018
A Framework for Intelligence and Cortical Function Based on Grid Cells in the Neocortex Related Papers	60	4632	May 16, 2019
"How Your Brain Organizes Information" video General Neuroscience	90	2500	May 1, 2023
Numenta Research on 3 Visual Stream & Deep Predictive Learning Current Research hierarchy , vision , deep-learning , neocortex , thalamus	3	754	January 27, 2020

Stable-Enforced Pooling

Related topics