There are several papers that discuss the What and Where pathways. The distinction was first observed in vision. Region MT is the first region in the Where pathway of vision whereas V1 is the first What region. The What/Where distinction was later found to exist in other modalities. I don’t recall the names of the regions but a quick search in Google Scholar will likely turn up papers that detail them. We recently read a paper (at Numenta) about the evidence for Where paths in audition and touch, I don’t recall the name or authors, maybe someone else can provide a link?).
Regarding hierarchy, there is a lot that we don’t understand, please keep that in mind.
- Going up the hierarchy the representations are driven by broader areas of input. That is generally accepted.
- A cell or column can only process the information from the part of the input it receives. This makes sense and most scientists accept the idea a small area of V1 just processes its limited input and passes the processed version up the hierarchy to be “integrated” by a higher region.
- I used to think that was the end of the story, but I now believe it is more interesting than that. Although a small area of V1 can only receive input from a small area of the retina, that small area of V1 can actually learn and model entire visual objects. Each small area, a CC, uses its model to predict what it will see based on knowledge of where on the object it is fixated. If a CC could talk it might say “I know what a pen is. I can only see a small part of a pen at a time, but as long as I know where on the pen I am currently fixated (location) I can predict what I should be sensing.”
- Most scientists think that a CC in V1 processes its current input and passes it up. What we are saying is that a CC in V1 integrates inputs over time to understand objects that are much larger than it can sense at any moment.
- There are limits to what any CC can learn (both memory capacity and learning time). So it is likely that a CC in V1 might only be able to understand objects and structures that span a subset of the entire visual field, but it would still be modeling structure much bigger than expected based on the corresponding part of the retina.