Mapping the hyper parameter space of classifcation using SP

marty1885 · November 16, 2019, 3:00am

Again, this is one of my stupid engineering ideas. In the past, HTM use swarming for hyper parameter optimization. But I find no data about how the hyper parameter space looks like. Is is like Neural Networks’s non-continuous, full of local minimum one? Or is it smooth and in fact we can use gradient assent for hyper-param optimization?

So I decided to map out a simple 2D parameter space (boost strength and # of bits). To see the surface for myself. The setup is to have 2 free variables, boost strength and # of bits of the SP’s output. And we measure the accuracy of the classification to create a 2D map of the hyperparam space.

(The classier scores a 40% accuracy when directly operating on the images without a SP. So it is possible to get worse performance with bad parameter.)

The parameter space isn’t exactly smooth. But (from experience) looks like momentum enabled gradient optimization can work. It also seems like we get the best performance when boost strength = ~0.1. Coincidentally being 1/num_classes in this case. Not saying the relation is definite.

Other picture from another angle:

If we look at a slice of the map at boost factor = 0, we can approximate the plot pretty well with a linear function. But I don’t suspect the trend keeping this way too much.

We can find the general trend (instead of the details) by smoothing the plot a bit. Basically, the more bits, the better. But you should set boost factor to 0.1

This is the source code for the plot. Run it in ROOT and the plot will show up.

gist.github.com

https://gist.github.com/marty1885/afbe7734281254133b2176ad01f8d0e6

c1.cpp

void c1()
{
//=========Macro generated from canvas: c1/Canvas
//=========  (Sat Nov 16 10:57:34 2019) by ROOT version 6.18/04

   gStyle->SetCanvasPreferGL(kTRUE);

   TCanvas *c1 = new TCanvas("c1", "Canvas",10,570,700,500);
   gStyle->SetOptStat(0);
   c1->Range(-1.006644,-1.180079,1.006644,1.180079);

This file has been truncated. show original

And the original data.

Let me know what else/analysis can I do with this data. I have no idea.

Other findings

For some reason or another. Zen 2 CPU performs extremely well for HTM. My Zen 2 machine though running at a lower RAM speed, is overall 40% faster than a Zen 1.

brev · November 16, 2019, 6:39am

Thanks @marty1885, I’ve been curious about this for years!

SeanOConnor · November 16, 2019, 8:08am

I always use this algorithm for optimization:
https://www.cs.bham.ac.uk/~jer/papers/ctsgray.pdf
It uses a simple mutation operator to produce a child vector from a parent vector. Then see if the child vector is better than the parent vector.

The mutation for parameters that are constrained to be between -1 and 1 is random plus or minus 2.exp(-p.rnd()) where rnd() returns a uniform random number between 0 and 1 and p is the precision of the mutation. You add such a mutation to each element of the parent vector to get the child vector. You have some choices about what to do if any of the child elements go beyond the -1 to 1 range. I choose to return them to the corresponding parent value.
What’s the use of all that?
1/ It’s simple.
2/ It works well.
3/ The mutations are scale free. A mutation of 1 is equally as likely as one of 0.0000001. You are going to get a mutation at any scale needed relatively soon.
4/ A parameter can radically change it value in a single shot. -1 can be mutated to 1 in one step. This allows a much better chance to step out of a local minimum compared to crawling around the cost landscape with consistently tiny changes to parameters.

marty1885 · November 18, 2019, 1:47am

This looks like some sort of genetic/annealing based method. Interesting.

gmirey · November 20, 2019, 8:56pm

As a side note, here is the potential space for the boost “value” of each column with such a boost “factor” of 0.1
BoostingOneTenth
…
probably worth skipping the call to an exponential function, by the look of it

marty1885 · November 21, 2019, 2:39am

Not really. The boost factor is a scalar applied to the average activation frequency to generate the boosting. Assuming we have the cells’ value x , the average active frequency a , the expected activation frequency t and the boost factor b (how aggressively the boosting should be). We can calculate the new value after boosting x' as x' = x*exp((t-a)*b) . Assuming having t=0.15, b = 2.5 thus having the function f(x) = exp((0.15-x)*2.5) . We can plot it as:

Bitking · November 21, 2019, 3:09am

If you use Subutai’s 8 bit values you can do this with a lookup table for blinding speed. For that matter, if you need better resolution a megabyte table is not unreasonable with today’s hardware.

gmirey · November 21, 2019, 9:10am

Yup.
But I was plotting for your proposed b = 0.1

“how aggressive the boosting should be” is for all intents and purpose a config value, only “variable” during research and tweak rounds to pin-point an ideal behaving function.
Now… you seem to have proposed arguments for a sweet-spot of 0.1 for it, so if we consider we’ve done with “research and tweak rounds” in that regard, we can look at the resulting function as if b was constant.

Now we see that the exp((t-a)*b) evaluated over (t-a) in [-1…1] with b fixed to 0.1 is like my plot
I personally have implemented the resulting boosting function as linear.

marty1885 · November 21, 2019, 1:38pm

Stupid me. I see.

For MNIST, yes. (The experiment was MNIST classification). But I’m not sure about anomaly detection and other kinds of classifications.

AMZ · November 21, 2019, 6:27pm

I do not recommend using boosting with MNIST as long as the input space is not statistically uniform.

marty1885 · November 22, 2019, 12:34am

But the evidence show a tiny boost can increase the performance.

AMZ · November 22, 2019, 2:49pm

I see your point! Honestly, I would agree if you got high classification accuracy. My recommendation is to achieve state-of-art performance or at least something close (I believe in HTM, you can achieve up to (94-96)% with MNIST), Then, try to see if the boosting can improve the performance further.

marty1885 · November 22, 2019, 2:54pm

The 98% accuracy is done by using a SDRClassifer. Which is a softmax regresser(i.e. a single layer neural network w/ softmax). So… Kinda yes and no. The plot I show is using an old CLAClassifer. It is more biologically consistent.

thanh-binh.to · November 22, 2019, 5:34pm

@marty1885 currently we reach 96%. 98% is excellent result

AMZ · November 22, 2019, 7:09pm

@marty1885, have you got the 98%? Any reference?

Thanks,

marty1885 · November 23, 2019, 4:06am

With the biological consistent classier? Wow, I must be doing something wrong.

gmirey · November 26, 2019, 9:35pm

Toying with different boost functions and strengths atm, trying to investigate what SP does.

What I’m getting out of it is that activation of similar-columns when similar-input from one epoch to the next can get really scrambled with too aggressive a boost function. Thus we’d expect the subsequent TM to be more at a loss remembering anything.

SP method + boosting method & level + learning strength + input signal distribution + wait for how long (I’ve seen stuff strangely stabilizing over… really long periods at times) + which to prefer between output stability and “every cell shall carry equal information” = some quite high “hyper” parameter space to explore indeed… too high for me at least

Topic		Replies	Views
Few questions about SP boosting, SP output stability and TM Getting Started	5	750	September 25, 2020
How boosting changes the SP representation NuPIC spatial-pooling , boosting	23	2460	July 28, 2019
Parameter Optimization and identification HTM.Java	5	1291	June 5, 2017
Boosting in HTM and new ways General Neuroscience	27	2461	March 13, 2018
Intuitions about learning parameters Getting Started	5	585	July 9, 2020

Mapping the hyper parameter space of classifcation using SP

Other findings

Related topics