I have now read the paper and the supporting complementary article. I have also watched HTM School ep. 15. To try to solidify my understanding I will try to describe a setup and ask questions relating to category inclusion and category separation.
I have been playing around with the MNIST data set since it’s come to be some kind of gold standard. In it, there are 10 categories with 6000 instances per category in the training data. Thus there exists 6000 instances of the category containing, for example, the number “3”. My experiment has been guided by the previous paper regarding object detection and I use a similar metric to see how well the network detects the categories.
In my experiment, I’ve chosen to, in some sense, simulate what a patch of the retina (minus all the extra computations that occur there) would report back when getting to see a sequence of four slightly overlapping parts of every image that is used for training. I do this since I’ve seen claims relating to saccades and I assume it maps well to sensing different parts of an object with a finger.
I train a spatial pooler on the patches and its output is fed into a temporal pooler. The temporal pooler is also fed with apical data from two layers that I, informally, call “location” and “category” layers. If I were to map this to actual cortical layers I’d say that I have proximal input from the SP that is fed to layer 4. Layer 4, in turn, gets apical information from layer 6 (location) and layer 2/3 (category). Layer 2/3 is fed the state from layer 4 during training (as described in the earlier paper) and is trained on its own state to strengthen the category patterns and make the inference stronger when being tested. Locations and categories are randomly generated SDRs with 4 for location and 10 for category.
This leads into my first question:
Has the function of layer 2/3 been revised and thus removed from the theory about object detection? This paper makes no mention of it playing an active part but I have a hard time to intuit how it would work without it.
My second question:
On what level are the patterns for object detection unique? As stated earlier, the cortical column will have been trained on 6000 instances of every category but the value of being able to recognize a specific instance is very small compared to being able to pick the right category. If the sensor (in my very simplified example above) is moved in the same way for every instance during training and testing, the displacements will be the same and add little to no value. The four locations will repeat but add much more information together with the stable pattern in the category layer that only changes when an instance from a different category is to be learnt.
During my own experiments with the setup described above, I use a column with 100 mini-columns with 32 neurons each. I use the raw MNIST data with no pre-formatting. If it is enough to uniquely pick the right category at least once during the four exposures during testing, I reach something like 45%. If I add that this must remain the same during any remaining exposures I end up around 35%.