Sounds like @Rodi’s problem was related to recognizing that there is a C, A, and T (versus how to combine them). This is because every word would have a different physical size (for example an image with the word “CAT” has a different width than one for the word “TACO”).
One naive way you could potentially solve this is to give each possible letter position a dedicated number of minicolumns. Each of these “slots” would need to be trained to recognize each possible letter. The benefit is that every word would get a resulting SDR from the SP process.
There are a couple drawbacks. One is that it assumes the letters are all roughly the same width. This of course limits the number of fonts that could be learned. Another problem is that it wouldn’t be the most efficient solution. Some of the minicolumns would be uses much more frequently than others. If you needed to support every word in the English language, you’d need 45 slots. The 45’th slot would only be used to recognize the letter “s” in a single word (pneumonoultramicroscopicsilicovolcanoconiosis).
I would solve this problem a bit differently. I think implementing a simplified form of saccades would be a better approach. @sebjwallace suggested an approach that could be adapted to work here (see this post). The lower-resolution views could be used to determine where the breaks between letters are located, then higher-resolution views could be cropped before sending to the spatial pooler. Then you’d need to pool all the letters together into one representation. One possible way to do that might be the variation of the temporal memory algorithm suggested for pooling variations of a face, described in this paper.