Timeseries: attention to differences and importance of noise

There’s a problem in learnning sequences with sections of UNchanging input. I’ll discuss existing solution and a flaw I think I found in it, as well as suggesting some solutions. But first some background:

High-level

Premise A: attention to changes

Means that we only notice when something changes. This has to do with anomaly detection in HTM and Jeff’s old example in his book On Intelligence “when you enter your room, you’d immediately notice the broken vase”.

Some anecdotal examples;

Question 1: Is this “rightful ignorance” hardcoded in HTM-level neuron regions, or is it done by a separate, more high-lelel attention module (att theory)?

Premise B: Seen often, learned well

This is the main rule of statistical-based ML approaches, and Hebbian rule as well. If something is encountered often, we should learn it. → The more seen, the better learned.

  • this is what synaptic permanence and decay/punishing wrongly active synapses is for in HTM

Problem: Timeseries with (long) homogeneous subsequences

For example:
abcdddddd......ddddefg

State of the art solutions:

Numenta HTM: None

There’s no fix for this in the original HTM : What happens is SP will learn mostly D, and TM will reduce to sequence D->D.

HTM.core: Kropff & Treves

Implemented by @dmac to community htm.core:

* @params timeseries - Optional, default false.  If true AdaptSegment will not
* apply the same learning update to a synapse on consequetive cycles, because
* then staring at the same object for too long will mess up the synapses.
* IE Highly correlated inputs will cause the synapse permanences to saturate.
* This change allows it to work with timeseries data which moves very slowly,
* instead of the usual HTM inputs which reliably change every cycle.  See
* also (Kropff & Treves, 2007. http://dx.doi.org/10.2976/1.2793335).

(btw, I could not find a topic here where this were discussed?)

  • probably we should rename from timeseries to longSteadyInputs.

My problem with Kropff & Treves

I’ll have to restudy the paper, but definitely with the implementation (of K&T) we have.

void Connections::adaptSegment(const Segment segment,
const SDR &inputs,
const Permanence increment,
const Permanence decrement)
{
const auto &inputArray = inputs.getDense();
if( timeseries_ ) {
previousUpdates_.resize( synapses_.size(), 0.0f );
currentUpdates_.resize(  synapses_.size(), 0.0f );
for( const auto synapse : synapsesForSegment(segment) ) {
const SynapseData &synapseData = dataForSynapse(synapse);
Permanence update;
if( inputArray[synapseData.presynapticCell] ) {
update = increment;
} else {
update = -decrement;
}
if( update != previousUpdates_[synapse] ) {
updateSynapsePermanence(synapse, synapseData.permanence + update);
}
currentUpdates_[ synapse ] = update;
}
}

Problem 1: K&T satisfies premise A (differences) but breaks B (statistical learning)

Take this example:

  • only have inputs A, B
  • occurence of A:B is 50:1

Then HTM(K&T) will 49x ignore A and learn it same (=as fast, as well, as “strong”) as B. That is wrong.
In extreme example 1000:1 this will ignore learning / extremely amplify noise.

Problem 2: Selective learning

Inputs:

  • A: qwert xxx uiop
  • B: qweab xxx uiop
  • C: qwert zzz uiop
  • (D: 12345 678 90123)

ratios A:B:C:D = 50:50:1:50
Now learn sequences of A,B.C,D with given probabilities, ie AABBAAABADDDBBBDABCAAABBBBDDDC.

Problem, we focus extremely on the triplet xxx/yyy/zzz/678 part (biased because low probability of C), and we’d not learn the much stronger discriminator A/B = “the -ab- part in qweab”)

In this case, we learn:

  • change D/anything (or anything/D, I’ll use just D/*)
  • A/B (part -rt-/-ab-) 50:50
  • (A or B )/ C (part xxx/zzz) 2*50:1

=> we much strongly prefer/focus on the uncommon element C.

Afterthought: Help me with this example. The idea of uncommon C actually could be right. But it’d introduce a lot of noise. Maybe something like

  1. detect C as “emerging player”
  2. move to short term memory
  3. see if actually used, or if it was just noise?
  4. learn / stash

Afterthought 2: ok, my problem was this still breaks premise A (Problem 1).

Solution for problem 2: Measure difference on the whole segment, not single synapse

The selective learning can be resolved if we do not focus on synapse:prev_synapse update differences, but measure it (the diff) on the whole input.
Maybe rather than the whole input, consider for differences a dendrite (its synapses). This is more biological (a dendrite is a stand-alone unit, “whole input” is not).

Details:

  • update difference not per single synapse, but per all synapses on segment
  • this will drive segments to grow on “boundary where something happens”
  • add some noise tollerance. Ie ignore not only if prev == current, but if change prev/curr <= 2%.

Example:

  • this will create an ideal dendrite on -(rt/ab)/xx- boundary.
  • and on -xx-ui- → recognizes C
  • and on -rt/ab- → recognizes A/B

Solution for problem 1: Dropout, randomness

Problem A:B = 50:1 , we learn it 1:1

Side note: Dropout

I’ve recently implemented HTM MNIST experiment with dropout
Dropout WIP by breznak · Pull Request #535 · htm-community/htm.core · GitHub
Dropout is a commonly used technique in deeplearning that prevents overfitting by intentionally adding noise to the output function (on the active columns on HTM).
Dropout is similar (same?) to noise in input, and HTM proved to be very robust to noise, as well as to dropout.
On MNIST this slightly improved performance (1-2%), but this slight increase is equal to that of boosting.
PS: I’ll have to open discussion for dropout!

Anyways, back to the problem, and a solution.
In the 49x cases where we ignore the seen value, if we apply quite strong (5-10-20%) dropout to the active synapses, we’ll cause a portion of the previousUpdates_ to change every round.

  • we change (= will learn in next round) only a portion of the synapses → no huge changes that would leave to re-learning
  • the changes happen randomly and sparsely
  • but cummulatively, if we keep selecting from the same pool (pattern A) then we’d strenghten the whole representation for A (in course of many small changes)
  • boosting probably must be disabled, because it’d go crazy over the small changes and rest of empty inputs

Two high-level ideas

Attention module

Can we remove the “time series functionality” directly from Connections, and outsource it to a new Attention layer?
Its responsibility:

  • detect new patterns and decide if new pattern, or spam/noise
  • decide when an input becomes “boring” and ignored, and when it is learned to balance the strenghts of patterns according to their probabilities
  • can we use external inputs (which we have in Connections) to also serve as “external inhibitory inputs”? Directed by Attention.

Decouple serial SP->TM output

For some reason HTM is not based on Spinking neural networks which are biological. (I guess practical computation resources?)

  • seq: AAAABAAAA
  • this would require a hierarchical (2) SP/TM setup:
    • so first layer predicts next letter: SP->TP->next? (A->A->A…)
    • higher layer:
      • “ignores same inputs”
      • this is done by requiring some number of activations, which will not happen with solution for P2: dropout.
      • this SP’ → TM’
      • SP’ will process all inputs
      • TM’ will process only much less (changes A/B) (so from seeing first A, TM’ predicts B)
6 Likes

Sorry for the long wall-post! The more I was grasping it, the more ideas coming…

TL;DR:

  • we probably should not implement K&T as is now.
  • use re-change period time for which segment cannot be active once it fired (from SNNs)
    • biological!
    • would probably replace artificial potentialPct (overlap of potential pools of columns)
    • forces multiple columns/segments to cover same “feature” input field.
    • sorry, another new idea :frowning: :smiley:
  • use modified K&T (differences on dendrites) + dropout to learn not too much, but not too little
  • attention layer with inhibitory feedback to this layer?
  • hierarchical SP/TMs to learn on long subsequences
1 Like

I’m still a newbie in HTM theory, but I like the “Measure difference on the whole segment, not single synapse” idea and the use of a hierarchical setup. I believe that it’s unlikely that sequence memory in the brain works without any hierarchical structure.

Aside from modifying HTM algorithms, I was thinking that it’s also possible to change the inputs as to remove these sections. My idea was to group an input with the past inputs so that the unchanging section is hidden from TM. For instance, consider the following sequence:

0,1,0,0,0,1,0,1,0,0,0,1,0,1,0,0,0,1 [or 3 * (0,1,0,0,0,1)]

It could be transformed into this:

010,100,000,001,010,101,010,100,000,001,010,101,010,100,000,001

Or, using the following mapping:

{‘000’: ‘a’, ‘001’: ‘b’, ‘010’: ‘c’, ‘011’: ‘d’, ‘100’: ‘e’, ‘101’: ‘f’, ‘110’: ‘g’, ‘111’: ‘h’}

This:

c,e,a,b,c,f,c,e,a,b,c,f,c,e,a,b

The only constrain is how long these sections are, since increasing the size of the group will exponentially affect the number of possible categories. I’m for sure not the first person to propose that, but I think it was nice to share.

By the way, breznak, we have work to do in this second semester. The example above is part of my research up until now = )

HTM is only an approximation of neuron behavior and it does not copy everything.

Sometime the missing part is unimportant. Sometimes it can be approximated like the k-winner in inhibitory inter-neurons.

When you have a powerful model that seems to work well it is easy to think that the biology works the same way and miss important details.

In this case - habituation. When you present the same input repetitively the neuron response diminishes. If this was added into HTM the ABBBBBBBC response would act very differently.

Searching in Grossberg’s ART model depends on this behavior.
See section 2.3 and 4.11.3 here:
http://www.scholarpedia.org/article/Adaptive_resonance_theory

2 Likes

In particular, I think a relevant part that might be affecting this is missing: is the adaptive nature of learning and forgetting. Modulatory effects prevent the neurons to learn/forget with pedal-to-metal. My wild guess is that once “the sequence” is learned, the learning rate progressively falls to zero.

2 Likes

I’ve not experienced this “reduce to sequence D->D” behavior with TM. In fact the algorithm has the opposite behavior in my experience (i.e. it remembers every single “d” as a different context in the sequence). Given enough iterations, TM will correctly predict the entire sequence from beginning to end (assuming layer is sized with sufficient capacity).

In my opinion, the solution to this problem involves at least three things which are not currently covered in the algorithm. One is cell fatigue, as others have pointed out. Another is pooling of commonly seen sequences into features which themselves can be part of a sequence (similar concept to composite objects that Numenta has been talking about in the current line of research). A third is timing, where instead of learning “dddddddddd…” you learn “d repeating for a while”.

4 Likes

I’d like to point out, in case it’s not clear, that the cited paper (Kropff & Treves 07) is a different paper than the one describing grid cells. The cited paper is called “Uninformative memories will prevail: the storage of correlated representations and its consequences”. Also, I simplified the equations somewhat.

I think this effect is is created in the brain by NMDA receptors becoming desensitized. Here is a good review paper about NMDA:

NMDA Receptor Function and Physiological Modulation
K Zito, University of California at Davis, Davis, CA, USA
V Scheuss, Max-Planck-Institute for Neurobiology,
Martinsried, Germany

DIO: 10.1016/B978-008045046-9.01225-0

See figure 6

The calcium-dependent desensitization (also referred to
as calcium-dependent inactivation) provides a negative
feedback loop by which calcium entering the cell via
NMDA receptors in turn leads to the desensitization of
the receptor (Figure 6), although calcium from other
sources (voltage-gated calcium channels or release from
intracellular stores activated by second messenger cas-
cades) has the same effect.

1 Like

Premise B: Seen often, learned well

This is the main rule of statistical-based ML approaches, and Hebbian rule as well. If something is encountered often, we should learn it. → The more seen, the better learned.

  • this is what synaptic permanence and decay/punishing wrongly active synapses is for in HTM

Problem: Timeseries with (long) homogeneous subsequences

For example:
abcdddddd......ddddefg

“If something is encountered often at all, we should learn it.”

abcddddd......dddddefg
I argue that in this example, you’ve seen one good example of d. Since d is a single unchanging SDR, there is no information contained in the repetitions.

1 Like