Yes, you are on the right track, but the problem is deeper than scale or translation invariance.
Our brains learn the structure of thousands of objects in the world. My coffee cup is one such object. When I touch my coffee cup my cortex is constantly predicting what I will feel on my fingers as I move my fingers and grasp the cup in different locations. It is easy to experience this. Close your eyes while touching a familiar object and then imagine a movement of one finger and you can anticipate what that finger will feel after the movement is made. This tells you that you have a model of the object that includes what features exist at different locations on the object. When you move your finger the brain knows what feature will be in the new location.
You can touch an object with different fingers, different hands, the back of your hands, your nose, etc.and still make predictions about what you will feel. This tells us that the model of the object is not specific to any particular part of your sensory space, nor is it particular to any specific location or orientation of the object relative to the body.
So the problem is: we learn the structure of objects in one location using one set of sensors but can apply that knowledge to different sensors and different locations/orientations. This basic problem has been known to roboticists for many years. It is fairly easy to define an object's features in a cartesian coordinate frame that is relative to the object, and define a sensory organ's location in another cartesian coordinate frame that is relative to the body. If you know the location of the object in body coordinates, you can do the math to know the location of the sensory organ in the object's coordinate frame.
The cortex has to be doing some version of this, and it has to do it constantly, every time you move any part of your body. It also has to be fast. In our current thinking, each small section of each cortical area (a "cortical column") has to make this transformation, somewhat independently of the other areas. For example, each small part of your hand has to calculate where it is relative to an object somewhat independently of the other parts of your hand. This is what we were referring to as the XFORM. The XFORM converts a location in body space into a location in object space, and vise versa. It is extremely unlikely that the cortex is doing this XFORM using mathematical techniques.
We didn't start by trying to understand the what and where pathways, but in hindsight they are almost certainly part of the brain's solution to the XFORM problem. Where pathways form representations in body space and what pathways form representations in object space. What and where regions are connected via long range connections in Layer 6. Layer 6 projects to Layer 4 and these connections comprise 65-75% of the synapses in L4. Our current hypothesis is that the XFORM is occurring in L6.
There is a lot of data in the neuroscience literature that gives clues to how this is happening and some of the data is contradictory. We are working our way through various hypotheses.