No influence of learning based on the permanence of proximal connections

spatial-pooling

#1

Is it possible, that SP learning based on the permanence of proximal connections doesn’t help to improve result for some kinds of data?
Perhaps there is a way to fix it, for example by changing sparsity of the input data or using bigger SP?


#2

It’s possible. Could you say more? Do you have this problem with a current dataset you’re working on?


#3

My input is 120 elements long with sparsity (or it would be better say dencity :slight_smile: more than 50%. I tried different values of the permanence threshold and in the range 10-90% it doesn’t make my results better (they are the same or a bit worse).


#4

Are the encodings consistently the same sparsity?


#5

No, the sparsity depends on the exact pattern. But it’s never less than about 35%.


#6

That is a problem. You want the encoder to produce a consistent sparsity. CC @scott.


#7

A little variation isn’t a big deal but why such dense representations?


#8

I believed that was the role of SP to normalize sparsity :-/
Why it’s important? Only to make this learning mechanics work?


#9

Well, because that’s an encoder which works for me. I tried several approaches, and this one showed the best results. I perhaps can renormalize this input to make it artificially sparse, but think it will make it less semantics. And this range of sparsity actually reflects semantics too.
So, I’m not sure - should I try to make it work, or ignore it and optimize other parts :-/


#10

I noticed something similar in cortical.io word fingerprints – words that appear in more contexts have more dense representations than words that appear in fewer contexts. In the case of their word fingerprints, though, it isn’t actually the density which is reflecting semantics. It is just an easier way to encode semantics.

As a simple example, suppose I want to encode an input “Diagonal Up/Left” to be semantically similar to two other inputs “Up” and “Left”. An easy way to encode this is to just make it a union of the other two inputs. This is a trivial case, but basically the same thing that is happening with the word fingerprints. Words that appear in many contexts end up being encoded more densely as a result. I don’t know if this is the case with your data set, but thought I would point it out.

If you think about the properties of SDRs, remember that the same semantics can be encoded after a loss of bits. What matters when encoding semantics is the ratio of bits that represent the semantic meanings being encoded. So rather than encoding “Diagonal Up/Left” as a union of “Up” and “Left”, instead it could be encoded with a random 50% of “Up” bits and a random 50% of “Left” bits, thus maintaining a fixed sparsity while still encoding the same semantics.

I’m not an expert, but I’m pretty sure dense representations will have a negative impact on the SP process. The denser the representations, the more input bits a column becomes specialized on. This could lead to a higher frequency of false positives. In the case of my above example, a “Diagonal Up/Left” representation containing a union of “Up” and “Left” might end up being represented by nearly identical columns to “Up” if that input happened to be better trained than “Left”. I don’t have any mathematics to back this claim up though, so take it with a grain of salt :slight_smile:


#11

Thank you for your thoughts @Paul_Lamb, they make sense.
BTW, taking your example, using a visual representation of “Diagonal Up/Left” you will receive such result automatically: the diagonal will have a small amount of up and left dots.
But talking about it in general - don’t you think that high density for a word with a lot of associations is part of its representation, showing its more complicated meaning structure?
Isn’t it better to find a way around this issue with learning comparing with losing some semantics as the alternative?


#12

Think of one note played very loudly. This sound wave contains very little information, but because the volume is high it perturbs many more neurons in the cochlea.

A chord played softly, but containing multiple harmonious notes, contains more information, however it will not agitate as many of the cochlea’s input neurons.

The cochlea plays a role here normalizing this data before it even gets to the SP, even in representing the harmonies and dissonance in the signal to the cortex.


#13

I’ve had some success by training autoencoders to represent arbitrary data (of arbitrary density) as sparse vectors, and then feeding that in as input to HTM, which may be an option for you. I described the approach here:


#14

What you are saying, is that one note shouldn’t be louder than others - it makes sense.
Nevertheless, even in your example with chord - let’s say you need one octave only. So, to encode the chord, you will have three active positions from twelve - it’s 25% sparsity. Would you decrease this sparsity somehow to the lower number? And how you can do it without losing some semantics?


#15

That’s interesting, but to do it, you are using traditional fully-connected NN with calculation of sigmoids, which is computationally intense and it’s unlikely that there is a simular process in the neocortex. Thus, I believe it should be a simpler solution for this problem.
Plus, I still not sure that high density and the broad range of it for different patterns is really a problem.


#16

The sigmoid and the output layer is only necessary during training. At test time you can compute only the linear activations in the hidden layer and take (for example) the top 10% as your sparse vector.

As for computation time, I think you’ll find that modern machine learning architecture is much more highly optimized than pretty much any other code outside of games and operating systems. Computation time will not be a problem for you. Plus it’s a tiny hidden layer; my GPU runs the encoder at more than ten thousand encoding steps per second.

Regarding biological plausibility, sure. If you need to justify it to yourself for aesthetic reasons, you can view this training process as something similar to the evolutionary process that determined our sensory encoding, or as one of the many early-development self-organization procedures that wire up our brains before and after birth.

It’s important to recognize that any learning system (biology included) is only as good as the encoding of its input, so it’s always helpful to meet the network halfway.


#17

It’s very interesting analogy, I’ll think more about it.
Nevertheless, I doubt we should split training and testing in this way. The agent ideally should have an opportunity to continue training during the “testing”. It’s unusual for traditional NN, but that’s how HTM in general works, and that’s how our neocortex works too.
You are right about encoders importance, but let’s not forget, that the neocortex has enough plasticity, for instance, to support vision from chemical receptors of the tongue or receptors in the skin. And it needs only days, not generations to do it.
I don’t care what bases should be used to create efficient AI, I just see, that everything else is (for now) way far in principals from neocortex organization.


#18

Since training is cheap, you can continue to train in parallel on a rolling buffer of the most recent N timesteps; that’s how I do it. In any case, it may help transform your data into something the spatial pooler can handle more easily.


#19

I’m still trying to understand three things:

  1. Why should the sparsity of encoder’s output be low?

  2. Why shouldn’t the range of sparsity be broad?

  3. Why can’t I use one more SP to transform my encoder output to an SDR with fixed sparsity?


#20

I think part of the answer is that, since the sparsity is low in biology and it works well then there is probably a reason behind it.

The input of SP is either sensory organs or other cortical regions, both of which have sparse activations. In vision you only perceive the edges, in hearing you only perceive the frequency peaks.
For example, if you are presented with pseudorandom noise (visual or auditive) you will have a hard time recognizing any pattern because the sensory activations will become dense and you will feel overwhelmed (while a computer would excel at this by doing convolution with the known pseudorandom kernel).

Now if you ask about why does it work in biology then I don’t know but my guess is that the SP is generally discarding away the input information that exceeds the threshold of output information sparsity. Of course, this process of information selection is useful but if the input sparsity is too low, SP will have a hard time figuring out which of the bits are relevant and which are not. By imposing an encoder sparsity restriction you make sure early on in the process that you are encoding the input in an efficient way.