I was reading a paper with content related to this topic.
In the paper, the authors say that the separation of the layers supports prediction and feedback on the success of that prediction. This includes considerable research to support the model referenced in the paper.
It goes on to build an intriguing system where the perception error and related learning is distributed along the what and where stream using local processing. Both the bottom up and top down information flow is described with an excellent exposition of how this all fits together. This model is more in-line with plausible biological function than anything coming out of the deep learning camp.
The part relevant to this discussion starts on page 4, left-hand side of the page. “How are the prediction and actual outcome separately represented, and how is the timing of the prediction and outcome coordinated & organized?” [1]
[1] Deep Predictive Learning: A Comprehensive Model of Three Visual Streams
https://arxiv.org/abs/1709.04654