Hi @naina, welcome!
Would it be fair to say your data is not temporal in nature? If so, are you using only the spatial pooler SDR as opposed to the sequence memory SDR? Sequence memory representations are unlikely to be useful for computer-vision-style single-image classification where there is no meaningful time dimension.
If the data is not temporal in nature, HTM may not be quite the right fit for this problem, because HTM is about modelling temporal sequences.
Assuming a sequential version of this problem such as video analysis however, the encoding also strikes me as an issue. I'm assuming you're doing this according to the following section in the paper:
"8. Encoding Multiple Values
Some applications require multiple values to be
encoded for a single HTM model. The separate values
can be encoded on their own and then concatenated to
form the combined encoding."
If you have 32x32x3 feature vectors where each feature is encoded by 1000 bits, that's a 3072000-bit input vector? If so, that is probably far too large an input space to learn to classify high level images like dogs and cats, unless you have millions of training samples.
My work involves images, and I've found the encoding to be the most important step. The problem with images is that they're so high-dimensional, the system would need an enormous amount of training data to learn anything useful. So I usually encode images by preprocessing with a standard sparse coding mechanism, like a bank of Gabor filters.