I have my own take on the objective functions for the spatial pooler. I’ve been studying sparse representation theory for the past few months and I think it may have some relevance for formalizing the spatial pooler models and algorithms.
Sparse representation theory was primarily developed for signal and image processing applications as a means of reducing the storage and transmission requirements for very dense, high-dimensional data being generated by sensors. The fundamental assumption underlying the theory is that there exists a well-defined structure to the sensed data that can be exploited and that the dense data collected by the sensor is highly redundant.
To put this in the context of a concrete example, image you have an NxN image patch being generated by a CCD array in a camera. If you look at all of the possible images that could be produced by N^2 pixels, with up to 24-bits of information per pixel. The vast majority of those potential image patches would appear to us as random noise. Since the real world usage of this camera is likely not recording random noise, that means that there is only a very small percentage of that potential input space that is ever going to be sensed by that image patch. In other words, it should be possible to project the dense signal in NxNx2^24 space down onto a much lower dimensional manifold, within which the real world image patches are most likely to be found.
If this lower dimensional space is modeled properly, then a whole host of image processing tasks become much more practical. For example, noisy or blurry images are likely to be a very short distance from this manifold, and reconstructing the image from a series of learned filters can produce dramatically improved image quality.
So, that’s the theoretical basis. What’s the implementation?
Assume that there exists a dictionary of learned sensor patterns (filters, convolutions, etc.), and that this dictionary forms a basis set that completely covers the space of input patterns likely to be sensed by the device. (How this dictionary is learned in an active area of research.) Then it should be possible to decompose the input sensor data into a linear combination of these filters. e.g.
z ~= y = Ax
where z is the original sensed data, y is the reconstructed approximation, x is a vector of coefficients, and A is the dictionary of all learned filters. The number of filters in A may actually be very large, thus the dimension of x is also very large. However, according to the theory, it should be possible to approximate nearly all real sensed data patterns by using only a very small number of these filters. In other words, x is going to be a very sparse vector.
So, you asked about the objective function. There are a couple of approaches that can be used to obtain an optimal representation. To produce a representation that meets a prescribed error tolerance, epsilon:
min{A, x[i]} sum( ||x[i]||_0, i=1…M ) subject to || z[i] - Ax[i] ||_2 < epsilon
The ||x||_0 is the zero-norm (simply a count of the non-zero terms). So this basically reads: Find the optimal dictionary A, and set of representations x[i] that minimizes the number of non-zero coefficients in each x[i] while also satisfying the specified error tolerance constraint for each of the M data samples.
Another option allows more flexibility in the error tolerance in exchange for a bound on the sparsity of x; let’s say only allow k non-zero terms in each approximation. Then the objective looks like this:
min{A, x[i]} sum( || z[i] - Ax[i] ||_2^2, i=1…M) subject to ||x[i]||_0 <= k
This reads like: Optimize A and x such that we obtain the lowest L2 error over the data set while only allowing k or fewer non-zero coefficients in each representation x[i].