Let me share my thoughts.
as Dimensionality Reduction algorithm
The SP intuitively can be considered as a dimension reduction algorithm, however, it is agnostic to the dimensions and structure of the input, hence is unsupervised similar to an autoencoder. The SP uses a metaheuristic algorithm rather than gradient descent, hence it is not searching for something, it’s simply reorganizing itself.
as Classifier, Clustering algorithm
In most use cases of the SP, it is being used as a classifier. However, quite agnostically the SP is really just encoding the inputs as a result of the algorithm (see above as Dimensionality Reduction) - it doesn’t care/know about the meaning of the inputs. Because it encodes inputs, and the output set (number of columns) is usually forced to be smaller than the input size, then it reuses encodings for inputs that are semantically similar. Hence, the result are groups/columns instead of individual encodings. But note these groupings have no meanings at all. For users like us using the SP, we perceive them as classifications by putting labels on them using different algorithms (e.g. softmax). Therefore, the SP is simply clustering per definition, because it groups inputs, and classification is really only realized when another algorithm interprets its groupings.
Encoder, Feature Extractor
Same as above as Dimensionality Reduction Algorithm, it is an unsupervised encoder. Encoding involves encoding of core features in reduced space (e.g. compression), this is why it is also a feature extractor because due to space constraints it will be forced to extract the bits that matter most.
when Stacked together
When stacked together think of a stacked encoder.
It is counterintuitive to think about generalization here because encoders don’t necessarily generalize with regards to the meaning of generalization in ML. My personal opinion, seeking for generalization is a double-edged sword, when an algorithm is DNN it is ok it works for now, but otherwise it is fiction.
In the DL world, the stacked SP is similar to a convolution kernel. Why? The kernels (they are many by the way) in CNN extracts features and they are learned in an unsupervised manner, the stacked SP is learned by SP training (unsupervised) and when tested they encode/extract features. The stacked SP extremely (at least for this example) stabilizes its outputs/groupings, hence it is a better encoder and intuitively much similar to an autonecoder, hence it is not good for classification tasks,