Ive implemented Spatial Pooler (4096 minicolumns) → Temporal Memory algorithm and am using it to predict what it will see just moving a 128x128 kernal over a 600x700 image. The result is it filling my Ram (about 30 gb) with an absurd amount of dendrites, which makes sense because there are an absurd amount of possible inputs even on just 1 image, but this surely isnt right coz this wouldnt scale with millons of images we humans are able to understand.
My information flow is like this:
Pixels → gangleon cells → spatial pooler + movement SDR fed to Temporal Memory algorithm.
Im hoping someone can tell me if this is expected behaviour or not. Maybe i need another spatal pooler on top of the other one? Or maybe there are algorithms outside of numenta papers/BAMI that solve this?
Encoding gangelion cells, they are 1 if they are within a top% activation in a local area, and the eesult of that is fed to SP.
Help is appreciated!
I can’t speak directly to your problem, but what I can say is this:
In the TM algorithm when a mini-column fails its prediction, then it creates a new dendrite and populates it with synapses so that next time the predictions should work correctly. However if there is a bug in your implementation that prevents the dendrites from working correctly, then the TM algorithm will create an endless number of dendrites (and of course that won’t help if there really is a bug in the implementation).
Something to check is if any of the mini-columns are correctly predicting their activity.
If the anomaly is stuck at 100% always, then you probably have a bug in your program.
I hope this helps
It does eventually start making correct predictions once it starts treading familiar ground which requires both a familiar input AND motor SDR, but given how many unique sdrs you can create from just moving a kernal over a fairly small image and the amount of different directions AND speeds, there are tons of different possible next patterns, each of which gets its own dendrite segment per active minicolumn.
The destinations on the image are picked randomly and the kernel slowly lerps its way over there.
I thought about this too for a while. We-re nothing like CNN that scan all possible tiny patches/kernel at every given time step.
We use fovea, which behaves like an animal: focuses on a variable size area then it moves on a different area - not necessarily the same size, fovea’s “attention” motion is 3D: left-right, up-down, zoom in-out.
Now how this can help?
First, amount of data gathered is much smaller (or sparser if you like), that gorilla in the room experiment proves we mostly see only what we-re looking for. So it might help with limiting memory usage.
But how would it work in practice with HTM? well, it would need to predict an optimal patch/motion/patch/motion/patch… sequence.
That would require some form of RL which rewards “good motions” and “good predictions”. And that kind of sucks.
For simpler tasks like MNIST where all samples have a decent, common position/size you might go with predefined motions - e.g. all streams trained on same sequence of 7-10 patches, like start from a general image, then zoom in center, move left, right, etc…
Here-s another weird idea which might work well in MNIST:
First, train it to predict an endless loop pattern - p1, m1, p2, m2, … p7, m7, p1 …
That’s much like toddlers staring indefinitely at the same ceiling or toy.
Second, for each class (0 to 9) use a different motion pattern in training (learning mode). All ‘2’-s have same m1, m2, … but ‘3’-s will have a different loop.
And, hopefully, when it sees a new image, it will converge a motion pattern that corresponds to image’s class, because that loop (allegedly) will best predict the following step. For that (inference mode) you’ll have to have a predicted motion decoder that actually moves foveic patch into its predicted position (x,y,zoom).
A sklearn linear regression with 3 scalars values might make a decent motion decoder
I think this explains why the TM is constantly surprised and has to form so many dendrites before starting to predict anything.
The TM learns sequential patterns, but if the sequence of movements is generated randomly there’s no patterns right?
In this case I’d think the TM is only ever gets anomaly scores < 1.0 by eventually learning to basically predict everything from everything (or moving toward that).
I maybe missing something but I think TM doesn’t make sense if the sequential transitions are random.
Thanks for your responses, Ive thought a lot about this situation, and I guess my question really is this: Do we expect every possible spot a pattern could be on a retina has at least 1 set of synapses to represent it? for example, Imagine our inputs are pixels (on or off):
You could imagine moving this circle to dozens of difference positions in this image space. I imagine that maybe L2/3 cells could have a dendrite segment connected to every single one of these possible combinations, and stays stable for a circle existing anywhere in the image. Is it a reasonable assumption that we would expect every single one of these combinations have a segment attached to it, or do you guys think maybe some other algorithm is going on?
Have I reinvented [re-implemented] a wheel ™?
Of course, “destinations on the image” are not random. Chose any consistent algo to chose Points Of Interests.
Have you heard the “what / where pathway” hypothesis?
Under this hypothesis the visual cortex has two pathways for process these two types of information. The “what” pathway is largely invariant to shifting, scaling, and rotating. The “where” pathway determines the location of the object.
This is more high level than the question I’m asking. Granted, I could just be asking the wrong questions, but I like to understand the building blocks that form these higher level processes.
@Bullbash sorry I don’t understand what you’re saying here.
On the topic of how the “where” pathway learns locations, and how the “what” pathway makes itself invariant to small amounts of shifting/scaling/rotating, I made this video explaining my favorite hypothesis. This video explains some of the building blocks, but I think you’ll find that it does not explain enough to make a real & useful system.
I think the brain learns the equivalences between e.g. an “O” in the middle of the image and (an “O” in the right + fovea motion towards it).
So every static image is in fact a navigable territory in which what is perceived at a given moment combines both raw visual data + coordinates within that territory.
And fovea movement over a static image provides an arbitrarily large data set useful in learning structure of images (and eventually space and objects within it) in general.
@cezar_t below explained it better: an image is navigable territory… I just implemented it.
I think one way to avoid the combinatorial explosion of storing every possible transition between motor command and sensor translation is to utilize something akin to landmark or canonical features.
Essentially, your encoder layer learns to represent some number of canonical features commonly encountered in the environment; either in isolation, or in a manner that permits some form of feature superposition (combination). Thereafter, as the meta-feature and/or object layers (higher layers in the hierarchy) learn the layout of extended objects, they can direct the sensor patch(es) to saccade from one feature to another. Even if the directed saccade is not perfect, the mismatch can be used by something akin to the frontal-eye field to fine-tune the saccade to align the incoming sensor data with the stored encoding of that feature.
Think of this as acting in a manner analogous to how our eyes automatically align to produce a converged pair of images on our foveas. The mismatch in alignment is used to drive and coordinate the final adjustments.
Another way to think about it is in relation to auto-encoders. Auto-encoders will attempt to make corrections to an input that is similar to, but not exactly like a stored pattern. If we include the position/orientation of the sensor as part of the representation that needs to be corrected, then it may be most efficient to allow the sensor to translate to the closest match before trying to learn or adapt to any part of the input that doesn’t quite match with the stored pattern.
And this brings me to my final point. You don’t want to have to relearn a pattern just because it’s a little bit out of alignment with your expectations (or previously stored pattern), but you do want to be able to detect and learn from any anomalies or deviations from the previously learned pattern. A short burst of muscle commands should bring the sensor image into the best possible alignment prior to subtracting out the expected pattern. The remaining residual signal, if any, can then be isolated and either learned (if the residual is persistent or highly correlated with other context signals) or ignored (if the residual is transient or indistinguishable from noise).
That’s good, pretty close to how I think of it. It particularly emphasises how sensory and motor sides cooperate in building models of the outside world, if collections of patterns equate to models.
The eye sees a moving patch against the sky; brain interprets it as a moving object, with or without detail. Eye motors track the object, other motors act to intercept or evade, then to catch or defend.
At the lower levels sensory and motor have to be closely coupled, at higher levels cortex columns and SDRs plan more complex movements.
Is this just a thought bubble, or is there more to read elsewhere?
Thanks for this response! Very much the kind of answer I’m looking for. Does this idea you propose oppose the biology/neuroscience at all, or does it make sense that the cortex could be doing something like this to optimize number of learned patterns?
This issue of dealing with the combination of motor behavior and sensory inputs reminds me of Numenta’s locations paper - where there are both sensory and location signals. As I recall they have HTM regions dedicated to each which modulate each other, since location information is certainly relevant in predicting sensory inputs and vice-versa. Just thinking it may have relevance from the theory.
The ideas I described above are consistent with my current understanding of (and hypothesizing about) different cortical and sub-cortical structures, as well as various numerical and machine-learning algorithms. I’ve spent quite a bit of time discussing these ideas with @bitking, @falco, and @Socradeez, to make sure that I’m not completely off-base. So far, they’ve not raised any red-flags. Implementation is still on my to-do list.