What could be SP and TM's objective functions?

Hi, I’m wondering as to what could be SP and TM’s objective functions?

I know they are both doing a self-organizing type of learning but what are they really optimizing and how they are implicitly achieving the optimization?

I have a crude idea about this but it would be great to know other ideas as well.

4 Likes

I’m very much not a math person, so don’t have any objective functions to offer. But FWIW here are a couple more crude ideas/ intuitions that may help in the discussion:

For TM, I would say the algorithm is intended to minimize the difference between the number of predicted active cells and the number of minicolumns * target density.

IMO, SP is a bit trickier to phrase in terms of minimizing or maximizing something. I believe the goal of the algorithm is to fix sparsity while preserving the same ratios of semantics in the output as are in the input.

Perhaps one could argue that SP is not really separable, and that SP and TM together are a function whose goal is the one I mentioned for TM.

3 Likes

:point_up:

They evolved together within the same cell populations.

1 Like

No worries, and thanks for sharing your ideas. Math is costly without a good intuition anyway.

Could you please explain the logic behind this?

Sure, the output of SP represents active minicolumns. The number of active minicolumns that it outputs is equal to the total number of minicolumns in the TM layer multiplied by the target density (a.k.a “sparsity”, though arguably that is the opposite of “density”…) The goal of the TM algorithm is for each of these active minicolumns to have exactly one predicted active cell in every timestep (meaning the algorithm has effectively modeled its environment and is correctly predicting what will happen next in every timestep).

1 Like

Thanks for elaborating.

So how does the TM minimizes this difference?

By remembering every input in the context of everything that came before it. This of course makes an assumption that there are temporal patterns in its environment which repeat and are predictable.

I see. This seems to me like the high-level emergent behavior.

Would you know some other mechanism in the TM that’s much more low-level, something that can be quantified at the same time encourage the minimization of the difference you’ve just elaborated? It would be interrsting to get an insight about the relation of the low-level operations to their emerging high-level counterparts that we perceive as doing some task (e.g. prediction).

I have my own take on the objective functions for the spatial pooler. I’ve been studying sparse representation theory for the past few months and I think it may have some relevance for formalizing the spatial pooler models and algorithms.

Sparse representation theory was primarily developed for signal and image processing applications as a means of reducing the storage and transmission requirements for very dense, high-dimensional data being generated by sensors. The fundamental assumption underlying the theory is that there exists a well-defined structure to the sensed data that can be exploited and that the dense data collected by the sensor is highly redundant.

To put this in the context of a concrete example, image you have an NxN image patch being generated by a CCD array in a camera. If you look at all of the possible images that could be produced by N^2 pixels, with up to 24-bits of information per pixel. The vast majority of those potential image patches would appear to us as random noise. Since the real world usage of this camera is likely not recording random noise, that means that there is only a very small percentage of that potential input space that is ever going to be sensed by that image patch. In other words, it should be possible to project the dense signal in NxNx2^24 space down onto a much lower dimensional manifold, within which the real world image patches are most likely to be found.

If this lower dimensional space is modeled properly, then a whole host of image processing tasks become much more practical. For example, noisy or blurry images are likely to be a very short distance from this manifold, and reconstructing the image from a series of learned filters can produce dramatically improved image quality.

So, that’s the theoretical basis. What’s the implementation?

Assume that there exists a dictionary of learned sensor patterns (filters, convolutions, etc.), and that this dictionary forms a basis set that completely covers the space of input patterns likely to be sensed by the device. (How this dictionary is learned in an active area of research.) Then it should be possible to decompose the input sensor data into a linear combination of these filters. e.g.

z ~= y = Ax

where z is the original sensed data, y is the reconstructed approximation, x is a vector of coefficients, and A is the dictionary of all learned filters. The number of filters in A may actually be very large, thus the dimension of x is also very large. However, according to the theory, it should be possible to approximate nearly all real sensed data patterns by using only a very small number of these filters. In other words, x is going to be a very sparse vector.

So, you asked about the objective function. There are a couple of approaches that can be used to obtain an optimal representation. To produce a representation that meets a prescribed error tolerance, epsilon:

min{A, x[i]} sum( ||x[i]||_0, i=1…M ) subject to || z[i] - Ax[i] ||_2 < epsilon

The ||x||_0 is the zero-norm (simply a count of the non-zero terms). So this basically reads: Find the optimal dictionary A, and set of representations x[i] that minimizes the number of non-zero coefficients in each x[i] while also satisfying the specified error tolerance constraint for each of the M data samples.

Another option allows more flexibility in the error tolerance in exchange for a bound on the sparsity of x; let’s say only allow k non-zero terms in each approximation. Then the objective looks like this:

min{A, x[i]} sum( || z[i] - Ax[i] ||_2^2, i=1…M) subject to ||x[i]||_0 <= k

This reads like: Optimize A and x such that we obtain the lowest L2 error over the data set while only allowing k or fewer non-zero coefficients in each representation x[i].

3 Likes

Other than BAMI, no sorry.

1 Like

Thank you for sharing, this is also a very interesting insight!

With the SP I think this would equate to the following;

A

The algorithm (filter) and the configuration of the SP. The configuration is constant by default hence in the default SP’s case, it does not search for A’s.

x

The coefficients or roughly the combination of synapse permanence values.

y

Simply a state of an SP (Ax) which changes per iteration.

However, a major requirement for ML is Generalization. Hence y is actually a combination of previous y's so far which would also mean that Ax is a combination of Ax's and the operations involved can either penalize or reward the coefficients. This would also mean that the SP will not be able to maintain an optimized representation for all y's because a y could be disjoint to another y. If these disjoint inputs exist it would be catastrophic to the SP - I believe this is the cause of forgetting inputs. This goes back to one of my biggest questions in ML especially in DL, as to why do we force Generalization so much even though the model is at the mercy of the input’s disjointness (another topic to discuss).

I also believe that this objective function (OF) is related to the SP’s unknown OF. Things to note though that the SP is implicitly satisfying the constraints (unless some can show that it is explicit), therefore the constraint satisfaction here (subject to) is yet to be proven to be a constraint where the SP is aware of. Otherwise, without this or a similar OF, it is easier to see that the SP does not converge to an “optimal” solution.

These are of course again just my interpretation of the SP using the theory/equation mentioned above.

So, the connection that I have drawn between sparse representation theory (SRT) and HTM theory is at the Spatial Pooler (SP). To me it seems obvious that the connectivity of the synapses over the receptive field are encoding the filters in the dictionary (the columns of A).

In SRT, the values in x are real scalar values, so that the resulting linear combination of filters in A is an approximation to the observed sensory data. In this case, the columns of A are some input feature such as a texture or oriented edge.

In the SP, the values in x would correspond to the proximal dendrites on the output SDR neurons, and the columns of A would be the connectivity matrix between the input nodes and the output nodes (i.e. synapses with permanence values above the connection threshold). In essence, each column of A is a list of connected synapses on a proximal dendrite.

If we were to apply this interpretation to the SP, then the objective function could be expressed as something along the lines of:

Find the matrix A such that all input vectors can be fully reproduced as a combination of at most k columns of A. (Thus ||x||_0 <= k)

Strict adherence to the “fully reproduced” criteria may turn out to require a very large number of columns of A. So, it may be necessary to relax this constraint to simply “approximate to within some tolerable number of flipped bits”. This might also improve the robustness of the SP to small amounts of errors in the input encoding.

min{A,x} sum( || z[i] ^ (A x[i]) ||_0, i=1…M) subject to || x[i] ||_0 <= k.

or

min{A,x} sum(|| x[i] ||_0 , i=1…M) subject to || z[i] ^ (A x[i]) ||_0 < epsilon.

where ^ is the exclusive OR operation.

1 Like

I believe there can be multiple ways to interpret this which all depend on background knowledge. Anyway to me it’s the SP algorithm (a non-linear function) that’s encoding the filters by learning it from the inputs. The combination of these filters in turn encode the output SDR. I said the A was constant because it’s its output (filters) that change, however taking note of these learned filters is very important.

Thanks for this succint problem statement. What do you think about the computational complexity of this problem? It looks to me that it is intractable at some point because columns can be mutually exclusive and increasing k might cause overfitting.

Following up to the previous paragraph, I’m no ML guru, but what if there is no solution or if the observed sensory data so far can’t be expressed in linear combination of the filters? One reason for this is that one filter may cancel the application of another filter due to exclusivity in input data/bits. Maybe the problem can be relaxed to have an acceptable receptive field size so that to potentially minimize exclusive input bits, I think this is the also the reason why topology works.

In the learning phase, the dictionary is not constant. There are two steps to the Matching Pursuit / Sparse Hebbian Learning strategy.

  1. Given the current state of the dictionary, use matching pursuit to obtain a sparse coding for each encoded input signal.
  2. Once you have the sparse representations, use sparse Hebbian learning to tune the filters to better approximate the input signal.

Repeat until the filters in the dictionary converge.

It is a very computationally complex problem. Effectively NP complete with respect to the dictionary and representations. However, there have been a number of fairly slick algorithms that have been developed to cut through some of the more formal aspects of this search. In particular, greedy algorithms, such as Matching Pursuit, can be used to selectively pick off one parameter at time, and subtract it’s contribution to the signal to obtain a residual signal. This signal is then run through the same algorithm to obtain the next most significant coefficient. Rinse and repeat until the remaining residual is effectively random Gaussian noise, or some other stopping criteria is met (e.g. maximum number of parameters obtained for a desired sparseness).

This paper extends this approach further to include a boosting-like modification (a nonlinear gain control) to ensure that every output node has a roughly uniform chance of being made active during the matching pursuit encoding algorithm.

In that case, then the set of basis filters is not yet sufficient to cover the whole space of potential inputs. Either learning was stopped too early, or the system is being exposed to entirely new novel inputs. In either case, the remedy simply is to start learning again. Assuming you have a sufficiently large pool of filters in the dictionary, the ones that most closely match the new novel input will begin to adjust to provide a better representation.

1 Like

Thank you for mentioning Matching Pursuit, this helped me understand more your point-of-view.

My thoughts too.

This is I believe is a classic process in ML, retry learning/training until performance is acceptable. Did it somehow cross your mind that there’s a possibility or even could be a reality that only a limited set of inputs can be represented/generalized at a certain state of the model? In other words, given an input set A with size N a model can only generalize (e.g. through linear combination of filters) k number of inputs, where k < N? Another analogy of this is mutual exclusivity of events in probability theory because some inputs can’t occur at the same time, it is almost impossible to represent these inputs for example as linear combinations, because the exclusivity may cause the coefficients to cancel. Please don’t get me wrong and sorry for almost getting out of topic. I’m very curious about why IMHO in ML people do not quite recognize that only a limited set of inputs can be represented/generailized at a certain state of the model and I think you are probably one of the best members here to provide some insights on this.

In the case of the SP, column synapses can be mutually exclusive hence they may cause some learned patterns to be forgotten due to new inputs that are much more preferrenced. This could also affect the said filters to prefer the latest inputs and potentially ignore the other ones (exclusive to the latest inputs).

That is exactly the model that I am proposing (and has been proposed by others). I think you may be under some misconception about how a generative linear model (GLM) works. Given a complex spatial or temporal pattern or signal that is at least in someway non-random, it is always possible to approximate the non-random part of that signal by decomposing it into a linear combination of less complex signals (unless or until that signal is simple enough to be represented exactly). This is in essence what techniques like linear regression, polynomial fitting, wavelets, and Fourier analysis are doing.

The actual representation of the signal depends on your specific choice of model. The choice of model is critical for generating efficient and accurate representations. However, there are a number of general purpose techniques out there that can do a fairly good job of approximating nearly any non-random signal. (The random part can also be modeled, typically by using some form of statistical analysis.)

Now, I’ve been talking about representing the instantaneous (or over some suitably small interval) state of some sensory input by generating a model of its salient features. This is what the spatial pooler is designed to do. If your concern is more about representing state transitions, then that’s another topic all together. That is, in fact, what the temporal memory is designed to do.

1 Like

Please note that the SP is dealing with online learning. A fitted curve may require a certain number of epochs to adapt to novel input signals. Worst comes to worst when the next set of input signals are disjoint/mutually exclusive to the preceding signals where the SP has currently “stabilized” with, therefore those linear combinations may become irrelevant and thrashing between them can occur.

Nope thanks.

Thanks again for your insights.

This is a good point, and to the extent that an untrained system may possess basis functions (filters) that do not yet fully span the space of expected signals, there will be a fair amount of error in the early encodings. However, the whole point of training the filters is that they learn from the error signal. The residual itself is used to adapt the filters using Sparse Hebbian learning.

A[j] ← A[j] + eta x[j] ( z - (A x) )

This reads, if the j th component is active for a given input pattern z, then adapt the j th filter using the residual to nudge it in the direction of the remaining error signal for that input pattern. The eta parameter given above controls the learning rate and thus can be used to mitigate potential thrashing between competing filter states.

1 Like

I agree about the filters. Honestly though, at least for me it is not easy task to reduce the SP algorithm to this equation which involves error propagation. The SP algorithm’s low-level rules don’t explicitly use residuals or errors, so maybe this equation kind of simulates part of the SP algorithm.

I’m glad the “competing filter states” is mentioned here as this is where previously most of my questions revolve around. IMO for the SP’s case, these filters (if can be explicitly reduced from the SP algorithm) are bound to their transience due to the fact that there are competing states and the input stream can be stochastic.

I agree that the SP algorithm as it is currently formulated, does not entirely fit into this model. I offer it up as a potential candidate for a more rigorous mathematical description of what the SP is ultimately trying to accomplish. If the SP can be reframed into a compatible numerical model, then there is some fairly mature analysis in the existing literature which can be brought to bear on this problem.

If implemented correctly, the filters should begin to acquire the properties of the statistically most relevant features observed in the input data. The transient features would then be modeled as permutations layered onto these base filters, perhaps by using other filters if the transient features themselves have a discernible structure.

1 Like