The art of SDRs

sdrs

#1

The discussion over here got me started on one of my HTM rants. It’s possible that I’ve aired variations of this on the mailing list before: It’s both one of my biggest problems with HTM and yet not exactly a problem with HTM.

Input to real neocortex is always heavily preprocessed. We’ve been talking about color but the story’s similar for all sensory input: there’s circuitry between the sensory periphery and cortex that filters and shapes the activity. This generally looks like a statistical cleanup of the sensory input, filtering out lots of information that’s unlikely to be behaviorally relevant and narrowing in on Barlow’s “suspicious coincidences.”

The resulting input to cortex isn’t just sparse and distributed, but whitened (sort of) and heavily filtered to account for the natural statistics of the stimulus space. Is it reasonable to assume that a model of cortical processing could perform well with input where the only statistical constraint is sparseness? (I suspect not.)

It might be that HTM can do pretty well on some problems with naive SDR input, but I can’t shake the suspicion that clever preprocessing would at least make the system more performant and maybe allow it to solve new classes of problem.

I don’t exactly have a question here; it’s really just a rant. I don’t have a specific plan of action, either. I would like to hear if anyone’s thought hard about the statistical aspects of SDRs, either in general or for a specific domain.


#2

I assume you are not talking about encoders, but something more generally applicable to all sensory input? We’ve discussed a lot in nupic.audio the fact that the cochlea is extremely complex. There is so much going on in your ear to produce the impulses going into the cortex, it makes the job of creating a real “sound encoder” super difficult. All the encoders we use today are extremely simple compared to the preprocessing of neural inputs happening in our senses before they get to the cortex.


#3

I don’t think there’s going to be a general solution: the point, really, is that different pattern spaces have different statistical structure. For some domains, we have some ideas about what might be useful transformations (vision: RBFs, Gabor wavelets).

Even worse, I think that this preprocessing is mostly evolved rather than learned. Yes, all the wiring is done by learning rules maybe guided by some intrinsic activity (retinal waves or whatever) but the rules were put in place by an evolutionary process. I don’t expect that we could vat-grow a retina and have it learn a representation appropriate to market prices or EEG waveforms or whatever our non-visual domain of interest is.


#4

…and yet yes, I guess that’s what I’m looking for. Maybe just a list of heuristics, a common wisdom for building SDRs, that makes some account for statistical structure, knowing that for any particular domain there’s probably a better hand- (or natural-selection-) tuned encoder.


#5

Hi Kevin,

I agree with you. Sparsity by itself is not sufficient. It turns out that HTMs actually work best when the input is whitened/de-correlated - this can be shown through the SDR math [1,2]. There’s a bunch of evidence that the cortex is always decorrelating inputs. There are some other constraints that come out of the theory. For example, we want high dimensionality and sufficiently high number of ON bits. So 2 bits ON out of 100 is nowhere as good as 40 bits out of 2000, even though they both have 2% sparsity. In addition we want individual bits in the SDR to have some sort of semantics (e.g. represents an edge at a particular orientation and location). This is similar to what you mentioned regarding filtering. All these together provide some additional constraints over and above sparsity.

As you mentioned, our brain has evolved some very specialized encoders for different sensory modalities such as the cochlea and the retina. The better the encoder, the better the performance of the HTM system [3]. Inside an HTM, the spatial pooling function is one component that is supposed to help maintain good SDRs as bits are transmitted from layer to layer.

This is an evolving area as we improve our understanding of how SDRs are used to perform various functions within HTMs. Your points are all well taken.

–Subutai

[1] http://arxiv.org/abs/1601.00720
[2] http://arxiv.org/abs/1503.07469
[3] http://arxiv.org/abs/1602.05925


#6

Subutai,

Thanks so much for your detailed and patient reply. I should have checked to see what you’ve been writing lately before letting the rant fly. Very much looking forward to reading the three papers cited.


Deep Reinforcement Learning, HTM
#7

No problem - SDRs are definitely subtle and tricky to understand fully. I don’t claim to understand every aspect - there is definitely an art to them!

Also, Matt’s videos on SDRs are really good if you haven’t watched them - he covers the material in the paper in a more fun way than I ever could :slight_smile:

–Subutai


#8

This topic raised on the ML before (was it you karchie?)

From what I can see, the human body has general inputs (sight, sound, touch, smell, taste, sense of balance (orientation)) and additionally these inputs are reinforced in a loop to further refine their significance and ultimate information.

Since we don’t have hierarchy in HTMs yet, and therefore no reinforcement loop (yet), I can see where specific encoders are needed in order to handle the initial amount of preprocessing.

But could it be that once we have the ability to assemble hierarchies and process sensorimotor information in a refined reinforcement loop, that we could make similar very generalized encoders - and then let the HTM Networks use hierarchy to extract and insert more useful meaning as information ascends the hierarchy? Thus obviating the need for the development of specific encoders. Kind of like the advantage anticipated by embodiment of intelligence?

Eventually (if the above assumptions make sense), we could just re-use the lower portions of the HTM Network - share those across many different newly developed networks because those “lower levels” could function as sensor-preprocessors?

–David


#9

@cogmission probably me, yes. And the lack of H in HTM is my other big complaint–though as a practical matter, I think it’s entirely sensible to focus for now on what problems can be solved by the existing architecture. (My guess is that multilayers with feedback is going to be a Big Damn Hard Problem–but as I so often say around HTM, I very much hope to be proven wrong.)

As for whether additional layers and feedback can replace specialized preprocessing, well, maybe? I might be more certain if I had a stronger mental model for what sort of a device this is, for what kind of problems it is and isn’t good at. Looking at primates, I note that we still do an awful lot of our early sensory processing subcortically; but then it seems easier for evolution to add new stuff than to get rid of old, so maybe that’s historical accident rather than a guiding principle.


H is for Hierarchy, but where is it?
#10

Some of my playing around at Haskins Laboratories in New Haven was aimed at
creating pre-processing for speech recognition and speech understanding.
For starters, don’t try to copy peripheral sensory systems. And don’t think
you have to provide one input stream: for example you could send up
something like a sound spectrograph AND the waveform between say 40 and 300
Hertz. Extract the fundamental pitch and send that up separately ( if
that’s not cheating too much). For now, I think we’re just trying to give
the auditory cortex something interesting to play with which could be
temporary lox 2 visual inputs so that we can start developing perceptual
Fusion functions, psych lip reading.

Google mcgurk effect. I’m quoting Wikipedia now: The McGurk effect is a
perceptual phenomenon that demonstrates an interaction between hearing and
vision in speech perception. The illusion occurs when the auditory
component of one sound is paired with a visual component of another sound
leaving to the perception of a third sound."

I’m assuming that we’re not trying to do auditory localization yet.

Can we talk sometime? I know you’re busy. I’m working on “Incredibly
Flexible.”

I’d also love to talk to Jeff, but I’ve just noticed that he’s listed as
the principal author on the introductory chapter of the Biological and
Machine Intelligence document. Your system came up with it as a “Your topic
is similar to” hit from my new topic suggestion, "Roots of HTM theory in
scientific psychology (in process). I’ll ask again if I feel I need to talk
to him directly anytime soon.

I’m planning to come and see you guys in a couple of months ( 2 or 3, I
don’t know. We will see how it goes.)

Yay!


#11

Terry, this post really doesn’t relate much to the subject, which is sensory processing into streaming SDRs. Please keep the conversation relevant to the topic. Also, Jeff gets a lot of requests like this, too many for him to address each one. He’s really busy, and I’d like to keep him focused on what he’s working on. Hope you understand.


#12

Thanks for the papers, I didn’t know

On page 8, Eq. 6 and 7: if I understand correct, you should add s_i=s_j for all i,j before the equations.

In the paper random de-correlated neural activity is assumed. I think you need to understand the semantics of the data to create an encoder that produces such SDRs. So the probability of a false match should be in praxis (with no perfect encoders) higher. Or did I misunderstand something?

And yes, Matt’s videos are really good!


#13

8 posts were split to a new topic: Universal Encoder


#21

Yes, it does implicitly loop over all the D_i’s in those two equations (Eq. 5). Why do we need a j? It’s comparing a random new vector A_t against all the D_i’s in S.

Yes, that is exactly correct. If the SDRs are correlated the chance of error is higher. It could be significantly worse if there is a lot of correlation. That is why it is better to be de-correlated. There is a fair amount of evidence that neural activity tends to get de-correlated in a number of ways. Good encoders are one way, but within the cortex there is pressure towards de-correlation.

BTW, that paper will be updated in arXiv by Monday with a new version that responds to some reviewer feedback. It contains some new results/simulations, and a better discussion on the correlation issue.


#22

Hmm let’s try with an example: Let M=2, |D_1|=s_1, |D_2|=s_2

P(A ∈ S) = 1-(1-P(overlap(D_1,A) ≥ θ)) (1-P(overlap(D_2,A) ≥ θ))

The overlap function depends on s_i, so to get

P(A ∈ S) = 1-(1-P(overlap(D_i,A) ≥ θ))^2

s_1 and s_2 should be the same.

I was thinking it’s better for learning and generalization to have correlated SDR’s. Looking forward to the updated paper with the correlation discussion.


#23

In this setup (see previous paragraphs) all the segments have the same number of synapses. As such, P(overlap(D_1,A)) is the same as P(overlap(D_2,A)). |D_1| = |D_2| = s, no indices required.

Yes, this is true. The same input pattern under some distortions should lead to similar SDRs (overlapping number of bits). The degree to which this matches the stored prototype on the segment is controlled by theta. The system works best if different inputs then lead to SDRs that are uncorrelated with other different inputs. If you look at the space of all possible inputs, the SDRs should be as uncorrelated as possible. There is some discussion on this in the paper, but it is not clear as it could be.

In general I don’t explicitly quantify generalization in this paper, and there is no specific learning mechanism discussed for creating SDRs. This paper is somewhat abstract and discusses overall capacity, but at some point generalization should also be considered for a full theory. I think something like the PAC learning framework can be applied, but I haven’t though through how to best proceed. Good research topic if someone wants to work on it!! Perhaps you? :slight_smile:

BTW, looks like arXiv has the updated paper online now.


#24

Ok, I was a little confused because of


I was thinking it’s not clear that all s_i are the same. Anyway it’s really not so important:) , should just help to increase understanding of the formulas to others.

I was thinking that in SDRs every bit represents a pattern. A collection of patterns in a segment is a new pattern. The difference between SDRs should continuously depend on the correlation of the patterns they represent.

If we take numbers as example:
Correlation:
2= 011100000
4= 000111000
No correlation:
2= 010010100
4= 100101000
And now we learn 3 as something between 2 and 4. In the correlation case I could use the on-bit in 2 and 4 as a starting point for the SDR of 3 (for example 000100011).
If a segment should represent now numbers between 1-5 and it has already learned it’s connections to 2 and 4, 3 will automatically be with one bit part of the segment.

Would be interesting, maybe in some weeks. But before I still need to get better understanding of the concepts behind SDRs and HTM.


#25

Yes, this is very similar to how I see it as well. Instead of correlation, we tend to use the term “overlap”. (The word correlation has a related but different statistical meaning and the bits don’t actually have to be correlated.) What’s important in the HTM operations is that the two SDRs have a shared set of bits.

Another interesting thing is that each bit represents a pattern, as you said, but the meanings are also shared across bits and no bit should be critical. (This is the essence of distributed representations.) To use your example, an SDR in HTMs would in reality have a higher number of bits, such as:

2= 00000111111111100000000000
4= 00000000111111111100000000

Now, if you flip any single bit (or even a few bits), you still get significant overlap between 2 and 4 and you would still know they are related. This property helps with many things, including noise robustness, fault tolerance, subsampling, etc.


#26

Ok great, then everything is fine:)
(I just made the examples so small to better illustrate my question)

I was not carefully using the term “correlation”. I’m still not sure what’s the exact difference between “correlation” and “overlap” of SDRs, but I leave this question for another day.

Thanks @subutai for your answers!


What is the difference between "Correlation" and "Overlap" in SDRs?