This is a little early but it is available now so I wanted you to get it from us, not Twitter.
Yay! I’ve been looking forward to reading this since seeing Jeff Hawkins speak at Johns Hopkins APL last month. Thanks for the link!
Nice, some weekend reading
Lots of discussion on Reddit ML about this. I haven’t read any of it yet. I just can’t bring myself to go there.
I see redit as more of the howling mob. I think you are wise to hang back.
I got some stuff for twitter, but I can’t post here at work. I do get the tweet notifications.
One or two do have a good answer that I would like to offer.
The rest - meh.
I’m finding a much nicer conversation in general taking place here:
Some folks just want to insult for the sake of it. The anonymity of the internet probably helps that and I’ve arrived at the point where I just move on when I encounter it. Another common thing: “I don’t get what you’re saying, so you must be stupid.”
It’s a good example of how we humans project our understanding on to others, and an issue we’re likely to face when creating any AGI in the future as well.
I particularly like the way this theory explains the were and what pathways.
I have now read the paper and the supporting complementary article. I have also watched HTM School ep. 15. To try to solidify my understanding I will try to describe a setup and ask questions relating to category inclusion and category separation.
I have been playing around with the MNIST data set since it’s come to be some kind of gold standard. In it, there are 10 categories with 6000 instances per category in the training data. Thus there exists 6000 instances of the category containing, for example, the number “3”. My experiment has been guided by the previous paper regarding object detection and I use a similar metric to see how well the network detects the categories.
In my experiment, I’ve chosen to, in some sense, simulate what a patch of the retina (minus all the extra computations that occur there) would report back when getting to see a sequence of four slightly overlapping parts of every image that is used for training. I do this since I’ve seen claims relating to saccades and I assume it maps well to sensing different parts of an object with a finger.
I train a spatial pooler on the patches and its output is fed into a temporal pooler. The temporal pooler is also fed with apical data from two layers that I, informally, call “location” and “category” layers. If I were to map this to actual cortical layers I’d say that I have proximal input from the SP that is fed to layer 4. Layer 4, in turn, gets apical information from layer 6 (location) and layer 2/3 (category). Layer 2/3 is fed the state from layer 4 during training (as described in the earlier paper) and is trained on its own state to strengthen the category patterns and make the inference stronger when being tested. Locations and categories are randomly generated SDRs with 4 for location and 10 for category.
This leads into my first question:
Has the function of layer 2/3 been revised and thus removed from the theory about object detection? This paper makes no mention of it playing an active part but I have a hard time to intuit how it would work without it.
My second question:
On what level are the patterns for object detection unique? As stated earlier, the cortical column will have been trained on 6000 instances of every category but the value of being able to recognize a specific instance is very small compared to being able to pick the right category. If the sensor (in my very simplified example above) is moved in the same way for every instance during training and testing, the displacements will be the same and add little to no value. The four locations will repeat but add much more information together with the stable pattern in the category layer that only changes when an instance from a different category is to be learnt.
During my own experiments with the setup described above, I use a column with 100 mini-columns with 32 neurons each. I use the raw MNIST data with no pre-formatting. If it is enough to uniquely pick the right category at least once during the four exposures during testing, I reach something like 45%. If I add that this must remain the same during any remaining exposures I end up around 35%.
By temporal pooler, do you mean temporal memory, which recognizes places in sequences? The temporal pooler was from before the focus on objects and locations. It was meant to recognize whole sequences by pooling the sequence of inputs from the temporal memory.
It probably shouldn’t get confused about what the object is after it recognizes it. Does your layer 2/3 narrow down possible objects with each new input?
I’m not sure this is good with only one cortical column. You seem to be describing voting, where each cortical column tells the other columns what possibilities it sees so they can together narrow down the possible objects.
I don’t think so because it is mentioned briefly in the locations paper. I recall that is the output layer, which narrows down possible objects, so I don’t know the answer if L2/3 is for something else.
I don’t think the object detection system generalizes well right now, so it pretty much recognizes the instance of the object rather than the category. There will probably be ways to generalize better once more progress has been made, like similar representations for similar locations. I don’t really know if it does anything like that yet though. The current goal doesn’t seem to be accurate categorization of different instances of the same thing, let alone novel instances.
Behavior is fairly random. I don’t think the displacements are just movements of the sensor. They might be the movements between each feature, or something closer to the object, like the path integrated difference in the locations of each feature pair.
The locations won’t repeat if the instances of the number “3” are different. They are locations of the sensor, but of the sensor when it is touching the object. Otherwise, there’s no feature so the location of the sensor has no influence. This is probably easier to think about in terms of touch.
“3” might be too complicated to recognize in big chunks. Try reducing the size of the sensory patch, so it only sees a simple curvy line at each moment. 3 isn’t too complicated for us to recognize with a single sensory patch, I think, but HTM is incomplete and we have huge visual cortices with a bazillion specializations. Our vision is really good. Instead of people, I would think about it in terms of a rat, which has blurry vision. With blurry vision, it can’t just look at the whole object and see what it is. It needs to move around to a bunch of different locations to figure out what it is. I imagine that’s true even if it can see the whole object at once because it provides more information about the shape.
Thank you for taking the time to digest my wall of text. I had a hard time deciding on how much information was needed for the intended context.
I’m referring to the algorithm as described in HTM School with some of my own additions relating to inhibition and dendrite activity. I guess I’m a bit stuck in the nomenclature as it looked at least two years, or more, ago.
It would of course be best if the category stays stable but I’m not surprised that data of this sort can give unstable results. And yes, the training of the 2/3 layer is intended to give the same result as in the previous paper on object detection. As in, the first exposure activates as many possible object patterns as possible and for every new exposure, the state in layer 4 is biased by the state in layer 2/3. Layer 2/3 then uses this as a bias to further narrow down its own possibilities since the neurons in that layer have been trained to be activated by specific object patterns in layer 2/3.
My implementation follows my understanding of the previous paper on object detection. Maybe I’ve misunderstood how layer 2/3 is supposed to strengthen the internal connections between neurons in the same pattern. Either way, this training of neurons in layer 2/3 seems to help with the narrowing down of possible patterns with few exposures and my column performs worse if I remove this functionality.
Perhaps this is a result of me using a column with a magnitude fewer mini-columns than typically are being used in Numenta research and in Nupic. My reasoning when it comes to number of mini-columns is that if I get 100 mini-columns to perform well enough, having a magnitude more of them should result in dramatic improvements. I haven’t decided on what is enough but my gut feeling is that if I can reach a stable 50% with 100 mini-columns, stepping it up to the common 2048 mini-columns would make sense.
Further, getting a small network performing well enough to solve simple problems offers more opportunities when it comes to running the algorithm on very limited hardware.
Ah, ok. Then I’ll assume I’ve overlooked something. I’ll spend some more time with the paper.
This sounds a bit unlikely. If we look at the popular coffee mug example, many different coffee mugs will, taking subsampling and SDR attributes into account, appear very similar. For example, you can not sense colour with your fingertip so the same model of mug in a different colour will appear identical even though they in one sense are very different.
So, feeling a lip on the edge, a cylindrical form with a bottom and open top together with an ear starting somewhere close to the edge and terminating somewhere close to the bottom should make category detection very possible. It would, of course, be possible to get into more details with a finer sensor but I claim that moving from the category “mugs” to “mugs with texture on the outside” is a very small step.
I’d say that my results show that the ability to detect categories, even if not intended, seems to work on at least some level with a combination of location, sequence of sensory inputs and category biasing.
But, just to be clear, a network that has trained on mugs will of course not do well if you show it a cat or something from some other very different domain.
To me, this sounds like a description of what I’ve done. The sensory patch is small and projects to a small number of mini-columns. Sub-sampling removes even more of the information that is needed to properly separate a “1” from an “8” or a “4” from a “9”. Thus I let the sensor be exposed to overlapping patches (that are smaller than the training image) that offer separation of location for similar features and topological information that connects the features.
Going a bit further down the grid cell encoding rabbit hole: If the highest level of representation is at the autobiographical memory level and cognitively there is some sort of representation and relationships between objects there should be some sorts of basic operations.
This is exactly one of the issues I have been thinking about for many years. (my oldest notes on this run back over a decade)
One of the possibilities that keeps bubbling up as a strong candidate for internal representation is the tuple. That uses our internal spatial representation and system arranges our objects as (object)(relation)(object)
BTW: It’s nice that the rest of the world is starting to converge on the internal spatial representation that seemed most likely to me over the years!
With this long, self-congratulatory introduction - tonight I bumped into a very interesting paper that explores some of these same concepts:
If you are the sort of person that uses a certain SH web page to view your papers you will be needing to use this DOI address:
Just posting an idea. This is highly likely not biologically possible.
We could make a grid sell encoder out of a capsule network. The routing mechanism works like a displacement layer. We could track where the values are ended up after routing, thus use it as the displacement.
Please read my post on Hex-grid cells. This is much simpler and more biologically plausible than using capsules.
This theory is so illuminating and beautiful, I really enjoy thinking about it and speculating further ideas. Many thanks to Numenta for sharing all these in an open and accessible way.
I have a question about “what” and “where” pathways. Let’s say I instruct another person/agent to manipulate an object and I already know the agent’s body space and behaviours well. So the task is to specify the movement in agent’s body space to get the desired location/state in the object’s space. Could it be possible that during this task, “where” region performs location computations on agent’s body space? If so, what could be the extent of spaces that “where” region compute locations on?
I will connect, this convergence is what will lead to the singularity…“We shall Ionize!i”
The paper makes several specific and novel proposals regarding the neocortex which means there are many ways the theory can be tested (both to falsify or support). In the posters we presented at the Society For Neuroscience conference this week we listed several testable hypotheses. Here is the poster about the new “frameworks” paper. It lists several testable predictions on the right side.
In practice it can be difficult to actually test these predictions. What is necessary, and what we do, is to sit down with experimentalists and carefully understand what their lab is capable of measuring and how that intersects the theory. This can take hours or even days just to design a potential experiment. For example, it isn’t known how capable rats are at distinguishing different objects via whisking (the active sense of moving whiskers). We predict that whisking should work on the same principles as vision and touch in humans, but we can’t ask the rat what it knows. We can’t even be certain that the whisking in the rat hasn’t evolved alternate strategies for operation. There have been recent advances in fMRI related to detecting grid cells in human neocortex. We list some of these in the same poster. fMRI might turn out to be a more fruitful experimental paradigm for testing the theory, but is limited in spatial and temporal resolution.
Bottom line is the theory makes many surprising predictions that should be testable, but it may take time to figure out how to actually test them.