I will leave it to @rhyolight to answer questions about how Numenta thinks grid cells and layers work.
As far as your nibbling at the edges of object representation (questions 4 & 7) - we were toying with that same question in another thread. When you are thinking about apple vs orange on the table you have to sort out where the properties are being represented. I see that this is more a question of hierarchy and not so much as a layer thing; the H of HTM. The WHAT stream is processing the cluster of features that make an object, the WHERE is dealing with the location. These fragments are combined at about the level of the temporal lobe into an experience.
I refer you to this post on the WHAT and WHERE streams, perhaps more detailed than the current question warrants:
This is how I visualize the various maps and streams working together - a bit cheesy but very visual: