Do learned reference frames in GridCellNet lack generalization capability?

Hi guys, I’m a master student at Bielefeld University and this is my first post, happy to be here :slight_smile:

I am currently trying to develop a deep understanding of the very interesting paper Grid Cell Path Integration For Movement-Based Visual Object Recognition. I think its one of the most recent implementations of object recognition with HTM incorporating location signals. But it seems like the implementation is lacking generalization capability: Every training sample is treated like a complete new object resulting in location signals, that are very different for objects which are very similar, like for example instances of the same object class.

I made a slide which illustrates my current understanding of the training procedure:

Training procedure in words:

  1. First Location of a new Object:
    1.1. the sparse location signal is randomly initialized; establishing a new reference frame with a low probability of overlap to previously seen samples
    1.2. this new reference frame means that not enough synapses fire for a preditictive activation in the sensory layer (some synpases might fire because of a small overlap to previous location signals, but it should not be enough to break the predicitve threshold). Therefore one winner neuron is chosen randomly in the columns activated by the new sensory input
    1.3. “During learning, the location layer does not update in response to sensory input;” (quoted from the paper introducing the location signal based learning), so in contrast to inference we now don’t update the location signal based on “sensory feedback” → therefore the reference frame is not updated
    1.4. reciprocal synapses are formed between the random location signal and the random winner

  2. Subsequent Locations:
    2.1. Update the location signal with the movement vector
    2.2. As long as we didn’t see this position in this reference frame before, there will again be not enough synapses firing to cause a predictive activation in the sensory layer , therefore again a winner neuron is chosen randomly in the activated columns
    2.3. no feedback to location layer from sensory layer during learning
    2.4. grow reciprocal synapses between (movement updated but still random) location signal and random winner neurons

This leads to a new and independent reference frame, with its own set of synapses for every training sample, even for ones with very similar visual features. Therefore multiple instances of the same object class would result in completely independent representations. → no generalization happens

If this is correct, it implies that after seeing many instances of the same object class, the similar visual features would invoke many or all of the previous reference frames. And these will not be disambiguated during subsequent movements since they all predict similar visual features. This would lead to a saturated representation instead of a sparse one.
I think such properties are not desirable. Do you agree? Do I have an error in my understanding?

Looking forward to your answers :slight_smile: