I’ll try to answer your questions.
If you haven’t seen it, this post might be useful: Preliminary details about new theory work on sensory-motor inference
It depends on the details, like how many inputs each SP cell has. If needed, you can always add another step before the SP to make if sparser. For example, numbers are usually encoded in bins rather than binary. As I understand it, numbers between 0-10 could be put in one bin, 11-20 in another, and so on, with one set of bins for each number in the input (e.g. temperature, humidity, and wind).
In the temporal memory, usually 2000. In the demonstrations I’ve seen for object recognition, I think in the ballpark of 500. It depends on the type of data, though. I used around 250 columns in a temporal memory which predicts the next letter in a small set of words, and it made some mistakes, but probably because I didn’t use sequence resets.
In the brain, it’s a lot harder to say. I can only make a guess for the rodent whisker sense. Making some very rough guesses, there are about 150 minicolumns per CC in one sublayer of L5. One study estimates ~1000 cells per CC in that sublayer, so I doubt there are more than 300 minicolumns.
All cortical columns have cells which project to other levels, so only a subset of cells project to other layers, but not a subset of CCs. These connections often don’t fit the traditional hierarchical scheme where each level projects to the next highest level and to the next lowest level. But since they are less numerous than connections which fit the scheme, many assume the basic scheme of a rigid hierarchy still works. Jeff Hawkins is giving an explanation for those sparse projections.
Those projections which defy the conventional rigid hierarchy make sense because the point is to narrow down the possible objects. For example, if S2 has narrowed down the possible objects to A, B, and C by touch, and V2 has narrowed down the possible objects to C, D, and E by vision, then they can communicate to realize the only possibility is object C.
The same idea applies to communication between different levels in the hierarchy. If different levels in the hierarchy narrow down the object differently, they can communicate to narrow down the object further.
He mentions these connections are sparse. Normally, people assume the traditional rigid hierarchy still works because sparse connections are considered less impactful.
Here's my understanding of why these sparse connections are sufficient for object disambiguation.
Even if S2 knows the exact object, the signal might be difficult to interpret by V2 because of sparse input. Each input from S2 is a bit ambiguous on its own, so, with sparse input from S2, it’s hard to integrate those inputs to remove that ambiguity because each V2 cell receives a small number of inputs. But individual cells in V2 narrow down the possibilities slightly based on their ambiguous inputs, and then they can compare to narrow down the possible objects. This allows them to interpret a sparse input unambiguous, even if no individual V2 cell receives enough inputs from S2 to infer that combo alone.