Stable-Enforced Pooling

hierarchy
spatial-pooling
grid
calvin
unsupervised-learnin

#22

Yes.

Another thought : I think there is more to what I am after than just Calvin grids. I saw them as a mean (and it seemed like a pretty good one at that) to achieve stability somewhere. But the core idea is mostly the coercion derived from it.

Pardon my brainstorming here :

So the griddy ‘boss’ is in essence situated on a layer towering the messy bits, as hinted by @sunguralikaan. The coercion it provides is in deciding that his dominion shall learn without a revolution. Because the boss layer output is itself part of what “the messy bits” process. Such that… when it refuses to change, they need to work out a “meaning” (learn towards prediction) from other clues, without any change in its own signal… thus increasing the chance that they’ll indeed wire on pertinent t-1 channels if processing a temporal sequence. All such context shall be indifferently available for distal wiring.

In turn, it sends basal input to higher module “messy bits”, and send apical feedback to lower-module “boss”.


#23

I’m throwing this out there, might muddy the waters, but it was inspired by bits of what I’m reading here; ignore it if it doesn’t make sense :slight_smile: .

In our hierarchy that we seem to be converging on here, where it’s semi-stable from top-down (with occasional variances in the lower level), might then lead to creativity and abstract thought be a top-down, pyramid-like activation that propagates down the connected structures, activating a higher-than-normal amount of those interconnected neurons, which in turn send their signals out.

Going bottom to top, encoders (whatever their biological equivalents might be, plus middle-men organs such as the lizard brain) send signals into the lower layers of the cortical column, which has its own hierarchy (which we’re teasing together, slowly). The minicolumns are mapped to more than one spot in the input space, naive to whatever that input might be. This works its way up to the higher levels, etc.

My understanding (which may be flawed) is that a spatial pool is a many-to-one mapping, with multiple points in the input space (or input spaces) potentially activating a single minicolumn… what if randomly reversing this signal from the top down, leads to semi-random outputs (i.e. encoder output, rather than input), which then backfeeds or laterally feeds into neighboring columns? It seems like this scheme, which is also biologically plausible, would lead to random, abstract connections between different processing regions, helping with imagination or theory building.

Just a thought. Take it, leave it, poke at it. I just love this conversation that I see here.


#24

It worked more like when the lower layer minicolumn activation changed by %25, the higher level minicolumn activation changed by %90. Of course these percentages change with permanence parameters. I guess it is still a reduction in some sense but the higher level layer amplified the lower layer change.


#25

About capacity concerns.

Let’s say we consider a 2D patch of 2,000 cells, sparsely filled at 2% (40 of them active).
SDR maths, if we were to pick those 40 at random, account for a staggeringly large number of encoding possibilities.

Now if we add constraints relative to a grid
Each triangle of roughly 50 cells has only one ‘position’ possibility. That consideration already greatly reduces the number of distinct encodable concepts here. But… it’s still somewhat huge (50 to the 40th power)
Now indeed if each of those enforce its neighbors to align to the same position, the total number for the whole 2000 cells patch gets down to a ridiculous ‘50’.
But that’s not considering grid rotation. All those are envisageable. On a given 50-unit discrete equilateral triangle, number of distinct rotation schemes (when corner order is irrelevant) is at least 10. That rises our alphabet to a more satisfying ‘500’ already.
(More satisfying ? Yes it is not a staggering number… but it’s getting somewhat close to what I understand seems ‘satisfying’ from Numenta’s point of view for the number of concepts that a similar-sized patch processing SMI can play with).
Now, if we introduce back some amount of tolerance from perfect-geometrical grids, then per tile I wouldn’t be surprised that a rule-of-thumb ~7 different positions could at the very least qualify for that same grid. That would bring the number of concepts that our 2000-sized area plays with towards 500x(7^40), and we’re thus back towards the ‘staggeringly huge’ end of the matter.
And that’s considering that the griddy sheet is perfectly 2D. If there is some amount of depth to the grid-forming stuff, you can assume each unit in depth is to be multiplied to the value exponentiated above. Eg for a 4x2000 sheet, 4-units deep, your capacity is likely 500x(28^40)… [Edited… sorry for my poor math here]
(…Except we’d also be modifying the originally given sparsity considering this. Please note, BTW, that HTM-proposed 2% sparsity, giving rise to the “1 active cell per 50-cells triangle” figure, seems quite consistent with biological support for a ~0.5mm spaced Calvin-grid overlay on a 50µm-spaced cell-lattice… although there could be some debate on the 50µm spacing side of the matter).

Also, at some point, reasoning about the capacity of large patches has also to take into account that grid-forming is a general drive for the layer, but is not ‘ensured’ either… there is still potential for misalignment at the edge of two really disagreeing “bosses”
On large-scale models, that usage is indeed more biologically relevant if most such “large patches” are neighboring each other, as they would on a flattened view of neocortex.


#26

Alright so not only we achieve pooling via grid formation but also grid cell like encoding that could overcome the capacity drop. I have a grid cell implementation embedded to the agent at the moment. It is not working on the level I hoped currently. I am having a tough time anchoring it to sensory input because my agent can also teleport. So I will keep on poking here with some potential difficulties for the sake of brainstorming.

The other obvious difficulty I see here is how do we path integrate feature or object representations to make use off the encoding (varying phase, orientation, scaling)? The navigational path integration of grid cells are done via speed cells, head direction cells and sensory modulation. At the very superficial level, it is doable with just a displacement vector which is what I have done just like Numenta’s Superficial2DLocationModule function in their location module. How would anyone approach path integration of objects/features? Would there be path integration unique to each object? (which seems like how Numenta is tackling the problem currently) Or would this grid represent all the objects in a single path integration where tiles represent objects? I guess both are the same thing in different scales. After all, objects are composed of objects.

Moving on to the actual path integration, what is the information that corresponds to speed, head direction or sensory modulation in this case? Also, can we calculate a “displacement vector” for path integrating features? I know from their location module, Numenta’s answer would be to obtain allocentric positions of features relative to the object. If we assume that this is true, then we cannot separate this pooling solution from positional information. What is essentially pooled becomes a location or a bunch of locations, not features. Can there be anything else that gives us some sort of a displacement vector other than allocentric locations? Is it possible to path integrate without some sort of a location? Also, I cannot image how this solution would extend to objects of objects at this point in time.


#27

@Bitking sees a relation between ‘grid cells’ and ‘cell grids’. At the moment I don’t. So I’ll let him discuss this ^^.

@sebjwallace, I’ll keep on educating myself about the matter you raise here (thanks for the links, btw). At present I fear I don’t have the prerequisites to give the well-informed answer it deserves… answer which maybe algebra-fluent thinkers, or PhD level and actual experimenters such as @sunguralikaan could give indeed.

If that does not bother you, in the meantime, though, we may discuss some of my intuitions leading to this

  • Existing strategies for full, bottom-up, unsupervised training, of multilayer networks reportedly show poor results. Early layers crunch upon stuff without a good objective and produce very suboptimal categorization. In turn, late layers suffer from that and produce poor abstractions.
  • It is posited that having, as fast as possible, useful abstractions forming on the higher levels, is critical to the success of the network… (In my view this turns the approach towards ‘supervised learning’ once the higher level is ‘useful’… and sup learning model are indeed kinda successful at some stuff… so it wouldn’t be too surprising)… But this is obviously a catch-22 problem.
  • Some models [1] were even able to demonstrate this by de-facto avoiding this catch 22, employing methods that they claim have some biological relevance… but I’m reluctant to embrace that view. So I was trying to reconcile this with HTM and our fully local understanding of ‘prediction’ as a particular cell state.
  • In my initial draft to solve it back to HTM proposals, then, to learn from a surprise (misprediction), knowing which layer to ‘correct’, we needed to be able to actively distinguish and signal between two cases:
    • “things that could have been predicted given the known context if we had known better”, ie there was a perceivable info potential that was not sampled at all. In which case we grow a segment towards it and will hopefully know better next time.
    • “things that seem to make no sense at all”, ie we need to find a model for it asap. In which case we need to be very loudy to drive higher level learning.
  • The above is, to my eyes, a way to fast-bubble up learning “urges” towards higher-level understandings first… However, it simply did not seem to be directly implementable, biologically, by “signalling” those distinct cases of misprediction… at least not given my current knowledge.
  • But thinking about it another way, stability and coercion could, in essence, lead to same bubble effect. Either there is a revolution and higher level has to learn, or there is none and lower level is on its own to solve it.
  • So to even believe in my own idea, I had to come with a somewhat more “stable” representation between layers, than those used by “Existing strategies for full, bottom-up, unsupervised training” which “reportedly show poor results”.
  • Crossing thoughts with concerns of sup / unsup learning, Kitties, door handles, cup hanldes, sequences and subsequences… this kinda rings a bell somewhere, for me.

Now. Some layman intuitions about the difference in stability between the grid and your proposal that reduced capacity is a sufficient condition to an effective increase in stability.

  • The griddy layer, imho… would have stripped out some (maybe a lot) of the potential ‘semantic overlaps’ one could expect (and hope) of SDRs formed by vanilla pooling.
  • So… It necessarily enforces some further stability, since the grid layer’s understanding of the world, on any subject, departs from lots of shades of grey, to a more black-or-white encoding.
  • How is that a good thing ? I don’t know, but I think it also points towards the reported “back-and-forth” nature of most of our perceptions, when dealing with uncertainty.

When looking at the following picture:
WireCube

At any given time, your visual areas allegedly interpret this as either ‘a cube seen from above’ or ‘a cube seen from below’.
Not at all… ‘yeah, well… a little bit of both’

If you’re ever able to conceive that it is ‘a little bit of both’ (or even that it is in essence not a cube at all, but simply a bunch of lines drawn out there)… I would propose this is worked out by some much more complex system in your brain, such as frontal cortex and working memory interactions.

[1] Deep Predictive Learning: A Comprehensive Model of Three Visual Streams
https://arxiv.org/abs/1709.04654
(originally pointed out by @jakebruce some months ago, if he were to drop by)


#28

I have thought about what I’d consider a first experiment… When you had time thinking about this, please tell me if you’re onto this same approach, or if I’m totally off…


Recap of HTM, mk 2016:
HTM_2016


Then first HTM-based trial for adding a grid pooler above this :


I’ve also tried to think about a bigger picture… Probably very wrong, but…
First draft for a full cortical patch:


Grid Cell Spacing and Receptive Field Spacing
#30

Nice drawings :facepunch:


#31

These are great diagrams!

I am guessing the output of supervisors are distally connected to target layers. The simple architecture above makes it seem like it should be distal but they are painted green on the one below. From what I understand, black is output, blue is distal, orange is proximal, green is proximal but I am not sure about dashed orange and dashed green.

I would like to ask something to understand it better and hopefully challenge it for a better one:

Specifically for the simple hierarchy (the first one), what differentiates a supervisor being this
image
from this
image

From my perspective:
The rotation would be the same even if there was any. I am guessing this is 2D, so no other dimension. There is a single supervisor that cannot be in conflict with another. The only thing left is there could be errors in this alignment in the first supervisor which would make it different than only one tile. Curious about your interpretation.

Edit: On a second thought, on bits would not end up in the same place for every tile if it got rotated right? At least given some period.

Edit2: Can a supervisor change its orientation and scale on the fly? Can it have 30 degrees for a certain input and 40 degrees for another? Can that happen naturally with inhibition rules?


#32

On last schema, Solid Black are abstractions of long range axons. Solid Blue is what I’m almost sure should be internally distal indeed. Solid Orange is input from either lower layer or thalamus.
Dashed is stuff I’m less sure about (and color scheme for these ain’t really representative of anything, sorry)

HTM distinguishes between proximal and distal. Which are indeed two different channels from the neuron model point of view, however I’m not quite sure whether they’re always distinguishable as a potential signal. I’d guess each afferent axon available to proximal location is also available to distal dendrites, sniffing for it, belonging to cells on same layer… but those connections are not, to my knowledge, part of what HTM 2016 was a model for anyway.

Rightfully nothing, from the point of view of a dendritic segment which would be interested in the output from that single cell…
… all the way to “The difference between a successful NMDA spike and none at all”, from the point of view of a dendritic segment which would be synaptically wired to this one, and a specific one from next tile, and another, and… all this at a specific orientation.

A grid which would resonate, calvin-like, does not seem to vary much in scale. aka ~0.5 mm due to mean axonal length and surround inhibition.
In my mind any pair of (modulo_position, rotation) is a different instance of a supervisor (ie, the ~500 estimate in the post about capacity… in the ‘pure’ geometrical case, but tolerance is quite important in my view). Switching from any of these instances to any other requires some fair amount of distinction in input (which is the whole point of the ‘stability’ idea) but is of course perfectly possible.

Unless you specifically speak about this very coarse picture showing a ‘grid’ only 2 cells apart for illustration purposes, my idea of it is of course more like what the post about capacity tells it is (about 10 cells on a side, and all latitude here for at least 10 distinct rotations)


#33

Some visualizations about these rotated grid concerns, as I guess this would help a lot…

GridRotation
On the image above, the green colored cells are those forming a pure, 10-spaced, lattice-aligned grid relative to the one colored in orange.
But even when using the lattice to approximate circle concerns and help with tiling somewhat, you see there is some room for other rotations. In blue is one which is rotated 2/9 relative to the lattice alignment (while keeping its ‘pure’ 10-spaced status… that is, not considering ‘tolerance’ effect, see below)


Recap of some of the underlying stuff driving grid formation:


You see above that whatever the position or orientation, the ‘scale’ here would not change much from one grid instance to another… but you see that there is also potential for some under-tolerance modulation of any specific tile.
Also keep in mind that what is represented above is the grid-forming potential. I assume the intra-grid neighboring connection will also be driven by ‘learning’ : ie. synaptically reinforce to actual known instances.


Some illustration of the kind of variations available:
GridWithTolerance

Note that I haven’t given much thoughts to what would be realistic tolerance values… I’m just confident that there is. So, the width of 3-cells in the rings above is kinda arbitrary. Experiments welcome.


#34

To the reflection the grids here seem more like reflecting 1% fill rate. Something must have been wrongly wired among my own neurons.

aiming at a perfect 2% fill rate falling on an integer spacing could be a 2-deep sheet, and a 5-spaced grid.
It makes for a 225-sized alphabet on larger-patch instance, each of which with a binary variation at the tile level, already when kept ‘pure’.

another close call for the 2% would be a 2D, 7-spaced grid. Giving a 343-alphabet on larger-patch instance, but requiring under-tolerance variation scheme for any modulation at the tile level. Possibly more room for tolerance width than the 5-spaced scheme though.


#35

Just a thought:
Why I think taking tolerance into account is important, as compared to simple tile-level variations (which multi-depth models could bring already), is that it allows that some of the tile-to-tile variations will finally translate to possible matches between distinct orientations on a larger scale.

It’s this balance between stability and flexibility which gives the thing its appeal, I guess.

However it makes some of my initial plans for computer-side tiled-abstractions a no-go. I need to think further about that.


#36

Still not an expert at this, but maybe part of an answer.
A MD5 hash may have far less bit than its input. It is thus more stable in that sense. However, it is also designed in a way ensuring its instability to tiny input variations. So, both concepts (size reduction and stability) do not necessarily go hand in hand, even though you may argue a smaller encoding is necessarily ‘more stable’.

Stable with respect to what is thus one of the questions. I’m guessing a calvin-grid encodes its input in a way that, like MD5, tries to take all of them bits into account, yet, unlike MD5, is driven towards resistance to small variations.
Also, “Stable with respect to previous state” is one of the things a calvin-grid would bring to the table. For better or worse.

[Edit] And from my current understanding of an HTM spatial pooler + Boosting, it will tend to ‘stretch’ the output image to span the entire available codomain, given current experienced input. Quite different from what I expect of a grid.


#37

MD5 is designed to do something entirely different to what the brain is trying to do. I would argue that comparing size reduction and stability between the two is problematic. Unless of course I’m not getting your analogy.

Upon more careful concideration recently, I agree that the number of cells per pooling layer does not have to be reduced for increased stability. The variability of patterns need to be reduced, but the number cells do not.

Given an input space of 16 cells with a density of 0.25 (4 active cells), there is a variability of 1820 input patterns. If you want to increase stability you need to decrease the pattern variability in the output cells. This paramater depends on the data (like setting K for K-means). In this case if I wanted to classify vertical, horizontal, diagonal lines and dots then the number of pooling/output cells I’ll need will be something like 25 (25 features / pattern variations). This will be 156% of the number of inputs cells but 1.37% of the pattern variability. So in this case the number of cells have increased in the pooling layer while achieving a tremendous amount of stability/generalisation (noise reduction).


#38

seems sensible, right ?

HTM spatial pooling does not. Calvin Grids arguably do ?


#39

I really don’t know much about Calvin cells, so I shouldn’t have really joined this thread. But when it comes to classic unsupervised learning this is the function of pooling. In theory (as described in my previous post) you can have more columns in the spatial pooling layer than the number of cells in the input space but still gain a massive level of generalization. Of course if you pool the output of a pooler you’ll get greater generalization. From where I’m standing generalization and stabilization are synonymous.


#40

Brainstorming here… everyone’s welcome. I’m glad you did.

We certainly don’t have to discuss all this if you don’t want to, though. But I guess you could bring an interesting point of view.

We both seem to agree that there exists pooling strategies which allow “more columns in a layer than the number of cells in the input space but still gain a massive level of…” stability.

A for whether stability = generalization, we could indeed envision that any morphism from any finite discrete space to a smaller one (whether smaller by available size, or smaller by de-facto K) necessarily has non-injective points and thus, in a sense, provides ‘generalization’. But when discussing about brains, we’d be better advised to consider only the subset of those that produce semantically-pertinent generalizations (that could be both the reason why I pointed to MD5 as a counter-example, and the reason why you think it was ill-advised in that discussion…)

HTM spatial pooler in particular, seems to behave nicely with semantics, but does not seem to enforce much stability, unless you reduce output size as you discussed with @sunguralikaan at the start of the thread.

It may be that ‘if it does not, it should’ as you pointed out… I don’t know. I’m not sure. The idea behind it all seems to achieve a representation of a given density, preserving semantics, while giving each output point ‘a chance’.

Even if it did increase stability, well… I haven’t experimented with all this myself, but… unsupervised training of a hierarchy of layers has reportedly proven difficult. I guess we need to find other ways to ‘solve’ that problem, or discover ways in which the brain does.


#41

I’ve been working with temporal pooling myself lately (non-HTM). I’m close to stacking it in a hierarchy. The pooling itself is pretty simple and very similar to spatial pooling in that cells complete to represent features.

I might take some time out of that to build a simple hierarchical spatial pooler (stripped down basics, no columns, boosting, etc.) to demonstrate this stuff.