Assume we use single fingertip to touch one object multiple times. For simplicity, a 2D case of touching concave polygon with a foamy circle. Isn’t it very difficult to generate good location signal based only on skin mapping, movement around complex polygon and accumulating on output layer? Instead, location signal may be retrieved from memory about previously touched polygon. Each touch is first transformed through such memory, and then final location signal is accumulated. In my opinion, location signal somehow must be accumulated too, not only final output is accumulated over time. Also, weak signal must have ability to perform drastic change in location signal (“rotate” it, “scale”, re-use previous wrongly classified features)
Here is an example explaining this problem: we have two concave polygons (A) and (B), with very different shape. But there is small area on polygon(B), let’s say 5%, which is exact copy of (A) polygon’s shape. Assume we touch polygon (B) 1000 times only in this 5% area, output layer will accumulate some outputs. Because touched area is exact copy, it is not possible to distinguish between (A) and (B). Let’s then do one final single touch in remaining 95% area. Because (A) and (B) have very different shape, output layer must somehow change to one or another object, but it seems almost impossible, because we only have one single “weak” touch against previous 1000 touches + locations. How this problem is solved using current model? I think one explanation may be that there are a lot of columns, so this final touch still gets through, but it still gives me feeling that it will not work so good, because previous 1000 signals had this advantage too.
The thousands of touches in the past resolve to an object model of accumulated perceptions at locations. Your library of objects are constructed of these perceptions at locations, put together in the same “map” where each location can store perception, and each location is topologically associated with the rest of the map.
If you reach into a dark box and touch something with one finger, you have very little information, but you have a perception (meaning an instant collection of spatial sensory input) associated with a sensory patch. You can immediately use this to filter your object library, because the sensory perception acts as a partial attractor, an incomplete neural code that could be resolved to multiple attractor objects in memory. If your finger touched something soft, the resulting union of attractors only contains objects with soft perceptions, no matter their allocentric location on the object, and also no matter the egocentric location of the object wrt to you.
As you either move your sensor, or add more sensors, the attractors filter more and more. Because the grid cell map can path integrate, perceptions are associated with their locations, so you can rule out large quantities of objects as you move your sensors across them, feeling out their shapes.
Yes, but i think this model can’t explain how wrongly classified objects could be fast recognized after “final single touch” of sensor. I do not think that library of objects is perfectly organized for all possible cases. If this library is made similar to self organizing-map, there are still be imperfect areas, where similar objects placed far apart. Maybe in some cases first thousand sensor inputs wrongly filter out some object, and then, surprisingly, “final sensor touch” results in instant recognition. With current model this would require drastic changes of integrated/accumulated data, and output layer must somehow revive outputs that was filtered out before.
I cant understand how 1000 versus 1 final touch problem would be solved using current model. For example. rectangle with rounded edges and circle. We may move our sensors 1000 times over rounded edge, and output layer saying it is a circle, then one touch on straight side - and output must flip to rectangle. It seems trivial when using only simple senses and few objects, but when adding all “multi-dimensional” senses, like texture, thermal properties, mechanical properties, sounds, large library of objects, there may be not enough neurons to do good enough library and recognition for all cases.
I am probably wrong, because current software neural networks do pretty good work, so there is proof of concept. But because location signal is so important, and cortical columns have only few neurons, i am thinking that there may be something else here, like some tricky mathematical “rotation” of output layer data and reusing it with new “weak” sensory data, or separate accumulation of sensor data.
Can you rephrase this in a real world problem? If points are picked randomly, chances of that is 0.05 ** 1000 (in python notation) which is… much smaller than… anything in the universe.
Even with 50% similar regions, and only 100 “touches” of each polygon, chance to always hit the similar regions is 0.5 ** 200 - negligible as 60 zeroes beyond decimal point.
At 50% similarity and 10 touches on each, chance to get it wrong is one in a million.
Another small clarification. In current model, how much columns are required for single object? In other words, in Numenta’s “cup” example, how many columns work together to recognize this cup?
One column can represent a cup. This is a “cortical column” or “hyper column”, not a minicolumn. If there are more columns, with lateral voting they can resolve objects faster.
Let me try to answer the question that I think is at the heart of what the OP was asking: How is it that we can so quickly disambiguate two very similar objects in a relatively few number of touches? I am slightly restating the question because I don’t think we would actually touch an object 1000 times in a way that would not provide any additional evidence for disambiguating between A and B. If the first few touches reveal features that are common to both A and B, then our next few touches would most likely be directed towards features that would be unique to A or B.
Regardless of whether or not the motor commands are being directed to disambiguate objects as quickly as possible, the result is that the spatial pooler and temporal memory will be producing SDR’s that are consistent with both A and B until such time as a unique feature is detected. The temporal memory will be making predictions to expect features from either A or B.
Now, if you were stopped prematurely or otherwise prevented from touching a unique feature, and asked which object you think it was; then your answer would probably depend on your prior experiences and biases with these objects. If you have encountered A more often than B, or some feature of B makes it less likely to be observed than A, then you would probably answer A. This response has less to do with the directly sensed objects than with your own mental associations of each object with the specific contexts in which they may be encountered.