The sensory streams are joined in the association maps. There the SDR’s that combines the senses can be formed.
Note that - to my way of understanding- an object is a basket of features. Both spatial and temporal.
It make a great deal of sense that this basket of features can include more than one sensory modality.
The quality processing stays in the stream where it is sensed. The micro-parsing that extracts levels of spatial and temporal information stays in the stream and presents that as the basket of features to be associated.
The counter-flowing streams act to help prime and parse the streams; a key part of prediction.
This model that is formed is the armature that is compared to the sensory stream. Any difference is novelty to trigger learning.