This is a little more speculative, but at a high level-- firstly I believe we need a layer which represents the object/concept which is currently being attended. This layer (which Numenta has labeled as the Object layer in a couple of their papers) would bias the lower layers. Conceptually, this biasing would work something like what I walked through in this post. Of course, that was just stepping through a toy example to a simple TM layer, so the idea would need to be fleshed out, and applied to more complicated SMI-related layers (i.e. reference frames, motor outputs, etc)
That Object layer would need to implement some form of Temporal Pooling. I think the “right” implementation would involve hex grids, but one could play around with simpler implementations (there are some floating around the forum) or forego learning in this layer initially (as Numenta did in a couple of their papers) while focusing on other pieces of the CC circuit.
It is in the Object layer where voting would occur. Whether that is through competing hex grids, or perhaps more simply by using the TM learning algorithm between CCs in this layer, the idea would be to set the layer up in such a way that it gets a driving input from the lower layers in the same CC and a biasing input from the Object layer in other CCs.