I’ve been toying with an idea lately. It came from thinking at the crossroads of discussions with @Paul_Lamb about sub-sequences, with @Bitking about me proposing sup. learning by hierarchy and/or teacher effects (also crossing questions raised by @sebjwallace, also the ‘everything is a Kitty’ idea @MaxLee), and finally the catch-22 status of unsupervised learning in a hierarchy, that is, “what part to train?” (also given that training higher-levels in priority seem to provide better results… but then, how ?).
Now I feel there could be a scheme which takes each of those concerns into account. Let me try to express it.
Now, I don’t know whether that idea is really worth the wear of my keyboard, typing it… but I realize that for this to work anyway, we’d first need to enforce that high-level stability somewhat. HTM spatial pooler may do some of that already by allowing only the few ‘most’ excited cells to actually fire… but I guess we’d need something a little more auto-sustaining than that… while staying at same overview level.
Stable-Enforced Pooling ?
Then comes Calvin and his grid resonance thing… proposing that small cortical patches would interlock at somewhat fixed intervals (about half-millimeter on a side).
I’m not proposing to simulate the chaotic attractor effect and complex inhibition mechanisms leading to this… nor even discussing whether he’s right or wrong from a neuroscience point of view… rather to salvage this idea and ‘assume that they do’ for our model purposes, so that a ‘stable-enforced-pooler’ is made of tiles, each tile strongly pushing its neighbors to settle down to a same-minded tile.
I don’t have the specifics of this, mind you. But I can see already that there is potential for some computer-side optimizations. And if the big-picture concept of despotic areas in a hierarchy turns out half-as-good as my idea of it, we may have :
Some nice model for HTM-based hierarchies
Which supports online learning in fully unsupervised way
Which allows any amount of supervision (or, more loosely, teacher effect) thrown into the mix.
Which could have an internal drive to solve on its own the sequence-or-subsequences questions
Note that subtending all this, I also have personal preferences for viewing distal dendrites as potential sniffers for any contextual info they can access… so it’s not necessarily the HTM-canon of sampling t-1 for TM-purposes. They may sample t-1, they may sample other stuff (such as soon-to-be-published location signal ? BTW, couldn’t location also stand for ‘index in apical-context’ when concerned about sequences) so that all this is solved in an homogeneous manner, not reliant upon the minicolumn thing… I don’t know if this is why I see those things as I see them, or if that’s not too relevant to the matter at hand. Well… Your call, guys.
Also, please consider reading @Bitking’s most excellent introduction on the subject of Calvin & Grids.
This gets to one of the major issues in training a hierarchy: how does the training distribute between the layers?
If you start from the bottom up there is no direction to work toward - no guidance. The usual hope is that the training will somehow accumulate until there is some small error and them “spill over” to the higher levels.
If you take the top level as one end of a chain there is a target or attractor to shape the exploration of search space.
At the risk of muddying the waters: That top-level guidance could start in the crude guidance afforded by the subcortical structures (the lizard brain) which are grossly correct from evolution.
As the agent learns it’s body by being pushed by the lizard brain the ground truth of the universe it finds itself in pushes on the “other” end of the various hierarchies so there is a dual force driving the exploration in the search space of solutions. (feedforward and feedback)
Somewhere in the middle of this hierarchy is the formation of grids/Calvin tiles that act to quantize the representation and propagate it laterally across the maps that form the hubs of the cortical lobes. This takes the highly local action of individual SDRs (really - in the brain an SDR can only be formed and expressed across the reach of a single dendritic arbor) and joins them into larger mutually reinforcing patterns.
If I am reading you correctly you are saying that this large stable state is an attractor that forms the boss or attractor in a stable state during unsupervised learning - acting as a teacher. It is flexible and can be trained to a new state if the world does not agree but in the interim - it forms it’s best guess and imposes it on the lower level maps as training material. This is what is compared to the ground truth of the perceived universe.
Yes, I guess what I fuzzily perceive is that our messy perception networks are for a large part “feedforward”, but to a point. Yes we’re all aware here that feedback is a large amount of info, but we don’t quite know what to make of it other that envisioning that “they help in recognition too”. But the thing is that they do not appear to benevolently “help understanding”. They’re also quite coercive, because they have reasons to be more stable.
I see this as potentially untying the knot of “how does the training distribute between the layers” when we’re following a pure online-learning scheme as HTM does.
Noting that reaching a stable layer (that… resists and echoes back ?), is both amenable to full-unsupervised (those stable layers will tend to form by themselves) and supervised training modes (model imposes a stable input somewhere close to high areas, or right-away impose the stable SDR to the highest layer).
I’d also put some of your lizardy prewirings in the “supervised” side of the training approach. Why not.
Stacking SPs should automatically produce stability as you go up. This is the key idea behind dimentionality reduction. If you had a 8-feature space at the top then all the huge amount of input data will be reduced and stabilized to those 8 features.
How about using something like clustering? Essentially doing the same as SP. The representation space is determined by the number of cells in a layer. As you go up the hierarchy you reduce the number of cells so therefore reduce the dimensionality of the data so therefore increase representation stability. The features will naturally fill the hierarchy this way.
I have been tinkering about this for some time. The more I read about the grid cells and this forum, it feels like there has to be a part that grids play in pooling. However, I cannot get past some probably basic problems. I’ll just try to brainstorm here.
The grid structure would help a smaller scale pattern (a local grid patch) to enforce its own pattern to the larger patch right? If this is the case, there is obviously a lot of redundancy on the larger patch. If pooling is partially achieved by this redundancy, isn’t the number of possible pooling patterns greatly reduced? The pattern has to repeat itself and if we apply this to an HTM layer (minicolumns make up the grid with local inhibition) the capacity suffers a lot. Does this mean a grid pooling solution has to be above the level of HTM layer? In other words, in a grid pooling solution, are the points on this grid HTM minicolumns or HTM layers (columns)? If the answer is minicolumns then the capacity suffers greatly, if the answer is layers then this pooling solution cannot exist inside an HTM layer. Am I even making sense?
Such concerns weren’t more than roughly brushed for me when I wrote this. But my first playing with the idea with a pen and a sheet of paper, drawing SDRs and input/outputs were indeed pointing towards a hierachical “module” where the grid-pooler cell population is above a more messy processor (HTM-based, although crunching indifferently upon input, context, and state of grid pooler itself…)
Got to this for a number of reasons, not even considering the interesting point you raise here… but maybe this will indeed add to the clues.
I’ll try to come with some diagrams to try and refine all this.
Now some first answer about representation size (but I guess @Bitking can provide more insight)
Grid capacity allows for some amount of fuzziness too. Interlock arises from phenomenons akin to wave interference patterns, driven by lenght of axon and strength of inhibition… so there there is room for some modulation pushing up the capacity a bit.
But quite generally you’re right, that could be a concern.
This does not happen with the competition rules in place and especially if you let it run for a while to take full advantage of the capacity available. Actually the inverse happens because of this competition, at least in my experiments where I stack 3-4 on top of each other. I guess parameters maybe tweaked to achieve this but as long as there is a competition there will be specialization to inputs from my intuition. Did you have the opposite observation? I am curious because I actually expected what you wrote to happen because of the reasons you pointed out.
The patterns do not exactly repeat.
This is one of my quibbles with the Calvin presentation. He shows an army of identical banana tiles as the response to a banana.
This is very misleading- each tile is responding to the part of a larger pattern that each tile shares.
Local to a cell the SDRs formed can only reach as far as the longest dendrite. This spans about a 16 column diameter. For the HTM models where we can sample anywhere in the input space this is not grossly inaccurate as our models hardly ever cover a larger space.
In the cortex there has to be a way to combine this relatively small spatial span to cover larger areas. Calvin tiles do this in a most elegant way - adjacent tile vote on the bits of a common pattern in a mutually reinforcing way and this allows a much lower confidence match with the ground truth to be recognized.
What the Calvin diagrams should show is the part of the pattern that the tile recognizes, with each one being different.
This is also a comment on clustering - with this spreading of activation a learned pattern can influence an adjacent tile that has not quite learned a pattern to recognize it and learn it. This is slightly different than independent islands of recognition and dimension reduction.
For implementation purposes, if I try to boil this to the “on” tiles of this grid, then the pattern is not the same throughout from what I understand.
So the global pattern is actually a spatial pattern that changes slowly throughout the grid space. This slow or minor change is caused by the interplay dynamics of adjacent tiles. This sounds a lot like local inhibition with overlapping neighborhoods. There seems to be a missing ingredient on my part. How is the larger pattern enforced?
Sorry for the confusion, I was referring to stacking actual HTM Spatial Poolers.
Just clear things even further, If we are talking about stacking HTM SPs that have lower minicolumn counts while we go up, then this would obviously happen because the representation capacity drops. It is kind of forced. However if same sized SPs are stacked on top of each other, higher level layers do not become more stable due to competition. If you meant the former in the first place, my bad.
One other minor detail, I put the grid-forming cells in Layer 2/3.
The HTM temporal sensing cells are deeper in the stack.
I see the two mechanisms as working synergistically.
My current take on how this works is that the maps with these mechanisms are passing data back & forth through both direct cortio-cortical connections and interaction with the thalamus-cortical connections. The CC connections can be thought of as direct data and the CTC connections are more tonic.
I see that there is a bigger world than the basic HTM model.
The L2/3 cells in the earlier perceptual levels may just be used to enforce sparsity. The early brain maps used histological differences to decide what the brain areas are and these difference are enough that you can easily see them with low power microscopes. It would not surprise me at all that the ratio between the dendrite arbors and inhibitory cells vary from one area to the next so that some can form grids and some just have cleanly defined edges to the sensed object. Since they don’t have to resonate to form tiles these L2/3 layers could run at much higher speeds, at beta or lower frequency Gamma waves. Grid-forming has not been observed in the early stages of perception streams; only around the hubs of each lobe.
So to answer an earlier question - yes, grids would be formed at the higher level.
Yup, I meant the former, as I was explaining in my original post.
If wonder if there is a reason to not decrease the number of columns as you go up. Perhaps to do concurrent contraction and expansion of feature space? My guess would be that expansion would do the same as auto encoders in that features synthesize/interpolate going up.
Are Calvin grids from the Cerebral Code book? I’ve been meaning to read it. Actually I’ve also been meaning to read @Bitking’s HTM Columns Hexagonal Grids as that looks pretty cool from a skim through.
From what I understand Calvin focuses on recurrent collaterals. There have been others that suggest that these oscillation dynamics breed a multitude of functions ‘for free’. It is almost like cells within a oscillatory group are polymorphic, in that they are the same cells but take part in many different functions. So it is not like ‘that cell deals with sensory’, or ‘that cells works with motor’, or ‘those cells are making decisions’, but rather they are all doing those things to varying degrees using the same sort of dynamics.
I really recommend the books Rhythms Of The Brain (Buzsaki) and Cerebral Cortex (Rolls) in which they both express this sort of concept.