The “encoder” is taking the place of the sensory organ in a biological system. The encoders we’ve creating are extremely simple compared to your cochlea and retina. Quick correction: Encoders do not have to produce sparse arrays, but their output must have semantic meaning. Spatial Pooling will extract meaning from sparse or dense input. For example, if someone shines a flashlight in your face, you might receive denser representations (more input activations) into the cortex from the retinas. The cortex (SP) will normalize this to a stable rate of about 2%. These sparse activations still have meaning, and cortical columns use them to build models of reality, or reference frames. The sparsity is super important, because is allows them to store a lot of objects and compare and contrast them efficiently.
The Spatial Pooling algorithm produces a stable sparsity. You can change this by setting a value in our models. It could produce dense representations, but sparse seems to be crucial to the whole system working. Input to the SP can be direct sensory encodings, but they must be binary arrays. They can also have topology. The SP can be configured with topology and local inhibition.