I believe you will need the distal context (i.e. where the features are located on the object) in many cases, but could depend on how similar the objects are or how specific you want your identification to be. The distal context will become important when you need to be able to distinguish objects which have different arrangements and/or counts of the same set of features.
Consider for example some 3D objects like "cube", "pyramid", and "octahedron". All of these contain common features like corners, edges, and sides -- all would probably be represented by the same columns. To distinguish between them you need to know either the positions of the features or their counts. You can get both of these elements of information from the distal context. On the other hand, if you just needed to identify that it was a "3D shape" (i.e. a class of objects) or if returning that "it is either a cube, pyramid, or octahedron" is sufficient, then identification by columns alone should work.
I believe you will also need this in most cases, but depends on how unique the features and/or their positions are. The pooling layer will also potentially have a more complete concept of the object (in case every column below hasn't sensed every feature on the object), so more likely to have an accurate prediction. If looking at only a single input, if that feature+location is common to many objects, you could identify a list of potential objects but not any specific one. Another way to approach this one would be to look at both the active cells and the predictive cells (since predictive cells will be driven in part by the pooling layer) -- would also get you to a specific object. At first this could still identify multiple potential objects, but once a few features had been sensed, it would lock into a specific one. Active cells in the pooling layer is still probably a better option, though.