When you preprocess your data such that the mean is zero, you are essentially translating the origin of the space to the mean value from all of the samples. In that case, your principal component analysis is revealing the higher-order moments in the variation of the data about this mean. The dominant eigenvalue/eigenvector among these moments is likely going to be proportional to the gradient of the hyperplane that best fits the data (e.g. in the least-squares sense for L2-regularization).

If you then subtract this moment from the data, and perform the same analysis on the residuals, then you can extract the next moment in the data using the same technique. At each step you are projecting each of the data points onto a specific axis in the data space (the one with the greatest variation in the data values).

By subtracting each successive component (in the vector sense: a[i+1] = a[i] - <a[i],e[i]> e[i]) from the data points and repeating the analysis on the residual vectors to select the next most dominant eigenvector, you are effectively doing a Gram-Schmidt orthogonal decomposition of the data space. (See section on Sanger’s Rule in the Olshausen’s paper.)

EDIT: The above statements are generally applicable to most linear transformations between vector spaces. However, I am still parsing through the details of applying PCA to the transformation of a correlation matrix (into a decorrelated diagonal matrix) to see if the above intuition still applies.

EDIT 2: I believe that some of this intuition still holds, however the matrix is not performing a linear transform from one space to another. Rather it is quantifying the pairwise similarity between all of the input vectors.