Mapping the hyper parameter space of classifcation using SP

Again, this is one of my stupid engineering ideas. In the past, HTM use swarming for hyper parameter optimization. But I find no data about how the hyper parameter space looks like. Is is like Neural Networks’s non-continuous, full of local minimum one? Or is it smooth and in fact we can use gradient assent for hyper-param optimization?

So I decided to map out a simple 2D parameter space (boost strength and # of bits). To see the surface for myself. The setup is to have 2 free variables, boost strength and # of bits of the SP’s output. And we measure the accuracy of the classification to create a 2D map of the hyperparam space.

(The classier scores a 40% accuracy when directly operating on the images without a SP. So it is possible to get worse performance with bad parameter.)

The parameter space isn’t exactly smooth. But (from experience) looks like momentum enabled gradient optimization can work. It also seems like we get the best performance when boost strength = ~0.1. Coincidentally being 1/num_classes in this case. Not saying the relation is definite.

Other picture from another angle:

If we look at a slice of the map at boost factor = 0, we can approximate the plot pretty well with a linear function. But I don’t suspect the trend keeping this way too much.

We can find the general trend (instead of the details) by smoothing the plot a bit. Basically, the more bits, the better. But you should set boost factor to 0.1

This is the source code for the plot. Run it in ROOT and the plot will show up.

And the original data.

Let me know what else/analysis can I do with this data. I have no idea.

Other findings

For some reason or another. Zen 2 CPU performs extremely well for HTM. My Zen 2 machine though running at a lower RAM speed, is overall 40% faster than a Zen 1.

7 Likes

Thanks @marty1885, I’ve been curious about this for years!

1 Like

I always use this algorithm for optimization:
https://www.cs.bham.ac.uk/~jer/papers/ctsgray.pdf
It uses a simple mutation operator to produce a child vector from a parent vector. Then see if the child vector is better than the parent vector.

The mutation for parameters that are constrained to be between -1 and 1 is random plus or minus 2.exp(-p.rnd()) where rnd() returns a uniform random number between 0 and 1 and p is the precision of the mutation. You add such a mutation to each element of the parent vector to get the child vector. You have some choices about what to do if any of the child elements go beyond the -1 to 1 range. I choose to return them to the corresponding parent value.
What’s the use of all that?
1/ It’s simple.
2/ It works well.
3/ The mutations are scale free. A mutation of 1 is equally as likely as one of 0.0000001. You are going to get a mutation at any scale needed relatively soon.
4/ A parameter can radically change it value in a single shot. -1 can be mutated to 1 in one step. This allows a much better chance to step out of a local minimum compared to crawling around the cost landscape with consistently tiny changes to parameters.

This looks like some sort of genetic/annealing based method. Interesting.

As a side note, here is the potential space for the boost “value” of each column with such a boost “factor” of 0.1
BoostingOneTenth

probably worth skipping the call to an exponential function, by the look of it :stuck_out_tongue:

2 Likes

Not really. The boost factor is a scalar applied to the average activation frequency to generate the boosting. Assuming we have the cells’ value x , the average active frequency a , the expected activation frequency t and the boost factor b (how aggressively the boosting should be). We can calculate the new value after boosting x' as x' = x*exp((t-a)*b) . Assuming having t=0.15, b = 2.5 thus having the function f(x) = exp((0.15-x)*2.5) . We can plot it as:

If you use Subutai’s 8 bit values you can do this with a lookup table for blinding speed. For that matter, if you need better resolution a megabyte table is not unreasonable with today’s hardware.

Yup.
But I was plotting for your proposed b = 0.1

“how aggressive the boosting should be” is for all intents and purpose a config value, only “variable” during research and tweak rounds to pin-point an ideal behaving function.
Now… you seem to have proposed arguments for a sweet-spot of 0.1 for it, so if we consider we’ve done with “research and tweak rounds” in that regard, we can look at the resulting function as if b was constant.

Now we see that the exp((t-a)*b) evaluated over (t-a) in [-1…1] with b fixed to 0.1 is like my plot :slight_smile:
I personally have implemented the resulting boosting function as linear.

1 Like

Stupid me. I see. :stuck_out_tongue:

For MNIST, yes. (The experiment was MNIST classification). But I’m not sure about anomaly detection and other kinds of classifications.

I do not recommend using boosting with MNIST as long as the input space is not statistically uniform.

But the evidence show a tiny boost can increase the performance. :smile:

2 Likes

I see your point! Honestly, I would agree if you got high classification accuracy. My recommendation is to achieve state-of-art performance or at least something close (I believe in HTM, you can achieve up to (94-96)% with MNIST), Then, try to see if the boosting can improve the performance further.

The 98% accuracy is done by using a SDRClassifer. Which is a softmax regresser(i.e. a single layer neural network w/ softmax). So… Kinda yes and no. The plot I show is using an old CLAClassifer. It is more biologically consistent.

@marty1885 currently we reach 96%. 98% is excellent result

@marty1885, have you got the 98%? Any reference?

Thanks,

With the biological consistent classier? Wow, I must be doing something wrong.

Toying with different boost functions and strengths atm, trying to investigate what SP does.

What I’m getting out of it is that activation of similar-columns when similar-input from one epoch to the next can get really scrambled with too aggressive a boost function. Thus we’d expect the subsequent TM to be more at a loss remembering anything.

SP method + boosting method & level + learning strength + input signal distribution + wait for how long (I’ve seen stuff strangely stabilizing over… really long periods at times) + which to prefer between output stability and “every cell shall carry equal information” = some quite high “hyper” parameter space to explore indeed… too high for me at least :frowning:

2 Likes