Timeseries: attention to differences and importance of noise

breznak · July 14, 2019, 12:52pm

There’s a problem in learnning sequences with sections of UNchanging input. I’ll discuss existing solution and a flaw I think I found in it, as well as suggesting some solutions. But first some background:

High-level

Premise A: attention to changes

Means that we only notice when something changes. This has to do with anomaly detection in HTM and Jeff’s old example in his book On Intelligence “when you enter your room, you’d immediately notice the broken vase”.

Some anecdotal examples;

predators do that (see “only” movement)
likely on low-level (HW, retina cells?)
- a neuromorphic DVS camera I have worked with works that way to produce sparse inputs Event Cameras, Event camera SLAM, Event-based Vision, Event-based Camera, Event SLAM
you don’t mind/notice the tick-tock sound in your bedroom
- or ignore the sensation of the clothes you have on

Question 1: Is this “rightful ignorance” hardcoded in HTM-level neuron regions, or is it done by a separate, more high-lelel attention module (att theory)?

Premise B: Seen often, learned well

This is the main rule of statistical-based ML approaches, and Hebbian rule as well. If something is encountered often, we should learn it. → The more seen, the better learned.

this is what synaptic permanence and decay/punishing wrongly active synapses is for in HTM

Problem: Timeseries with (long) homogeneous subsequences

For example:
abcdddddd......ddddefg

State of the art solutions:

Numenta HTM: None

There’s no fix for this in the original HTM : What happens is SP will learn mostly D, and TM will reduce to sequence D->D.

HTM.core: Kropff & Treves

Implemented by @dmac to community htm.core:

* @params timeseries - Optional, default false.  If true AdaptSegment will not
* apply the same learning update to a synapse on consequetive cycles, because
* then staring at the same object for too long will mess up the synapses.
* IE Highly correlated inputs will cause the synapse permanences to saturate.
* This change allows it to work with timeseries data which moves very slowly,
* instead of the usual HTM inputs which reliably change every cycle.  See
* also (Kropff & Treves, 2007. http://dx.doi.org/10.2976/1.2793335).

(btw, I could not find a topic here where this were discussed?)

probably we should rename from timeseries to longSteadyInputs.

My problem with Kropff & Treves

I’ll have to restudy the paper, but definitely with the implementation (of K&T) we have.

void Connections::adaptSegment(const Segment segment,
const SDR &inputs,
const Permanence increment,
const Permanence decrement)
{
const auto &inputArray = inputs.getDense();
if( timeseries_ ) {
previousUpdates_.resize( synapses_.size(), 0.0f );
currentUpdates_.resize(  synapses_.size(), 0.0f );
for( const auto synapse : synapsesForSegment(segment) ) {
const SynapseData &synapseData = dataForSynapse(synapse);
Permanence update;
if( inputArray[synapseData.presynapticCell] ) {
update = increment;
} else {
update = -decrement;
}
if( update != previousUpdates_[synapse] ) {
updateSynapsePermanence(synapse, synapseData.permanence + update);
}
currentUpdates_[ synapse ] = update;
}
}

Problem 1: K&T satisfies premise A (differences) but breaks B (statistical learning)

Take this example:

only have inputs A, B
occurence of A:B is 50:1

Then HTM(K&T) will 49x ignore A and learn it same (=as fast, as well, as “strong”) as B. That is wrong.
In extreme example 1000:1 this will ignore learning / extremely amplify noise.

Problem 2: Selective learning

Inputs:

A: qwert xxx uiop
B: qweab xxx uiop
C: qwert zzz uiop
(~~D: 12345 678 90123~~)

ratios A:B:C:D = 50:50:1:50
Now learn sequences of A,B.C,D with given probabilities, ie AABBAAABADDDBBBDABCAAABBBBDDDC.

Problem, we focus extremely on the triplet xxx/yyy/zzz/678 part (biased because low probability of C), and we’d not learn the much stronger discriminator A/B = “the -ab- part in qweab”)

In this case, we learn:

change D/anything (or anything/D, I’ll use just D/*)
A/B (part -rt-/-ab-) 50:50
(A or B )/ C (part xxx/zzz) 2*50:1

=> we much strongly prefer/focus on the uncommon element C.

Afterthought: Help me with this example. The idea of uncommon C actually could be right. But it’d introduce a lot of noise. Maybe something like

detect C as “emerging player”
move to short term memory
see if actually used, or if it was just noise?
learn / stash

Afterthought 2: ok, my problem was this still breaks premise A (Problem 1).

Solution for problem 2: Measure difference on the whole segment, not single synapse

The selective learning can be resolved if we do not focus on synapse:prev_synapse update differences, but measure it (the diff) on the whole input.
Maybe rather than the whole input, consider for differences a dendrite (its synapses). This is more biological (a dendrite is a stand-alone unit, “whole input” is not).

Details:

update difference not per single synapse, but per all synapses on segment
this will drive segments to grow on “boundary where something happens”
add some noise tollerance. Ie ignore not only if prev == current, but if change prev/curr <= 2%.

Example:

this will create an ideal dendrite on -(rt/ab)/xx- boundary.
and on -xx-ui- → recognizes C
and on -rt/ab- → recognizes A/B

Solution for problem 1: Dropout, randomness

Problem A:B = 50:1 , we learn it 1:1

Side note: Dropout

I’ve recently implemented HTM MNIST experiment with dropout
Dropout WIP by breznak · Pull Request #535 · htm-community/htm.core · GitHub
Dropout is a commonly used technique in deeplearning that prevents overfitting by intentionally adding noise to the output function (on the active columns on HTM).
Dropout is similar (same?) to noise in input, and HTM proved to be very robust to noise, as well as to dropout.
On MNIST this slightly improved performance (1-2%), but this slight increase is equal to that of boosting.
PS: I’ll have to open discussion for dropout!

Anyways, back to the problem, and a solution.
In the 49x cases where we ignore the seen value, if we apply quite strong (5-10-20%) dropout to the active synapses, we’ll cause a portion of the previousUpdates_ to change every round.

we change (= will learn in next round) only a portion of the synapses → no huge changes that would leave to re-learning
the changes happen randomly and sparsely
but cummulatively, if we keep selecting from the same pool (pattern A) then we’d strenghten the whole representation for A (in course of many small changes)
boosting probably must be disabled, because it’d go crazy over the small changes and rest of empty inputs

Two high-level ideas

Attention module

Can we remove the “time series functionality” directly from Connections, and outsource it to a new Attention layer?
Its responsibility:

detect new patterns and decide if new pattern, or spam/noise
decide when an input becomes “boring” and ignored, and when it is learned to balance the strenghts of patterns according to their probabilities
can we use external inputs (which we have in Connections) to also serve as “external inhibitory inputs”? Directed by Attention.

Decouple serial SP->TM output

For some reason HTM is not based on Spinking neural networks which are biological. (I guess practical computation resources?)

seq: AAAABAAAA
this would require a hierarchical (2) SP/TM setup:
- so first layer predicts next letter: SP->TP->next? (A->A->A…)
- higher layer:
  - “ignores same inputs”
  - this is done by requiring some number of activations, which will not happen with solution for P2: dropout.
  - this SP’ → TM’
  - SP’ will process all inputs
  - TM’ will process only much less (changes A/B) (so from seeing first A, TM’ predicts B)

breznak · July 14, 2019, 1:10pm

Sorry for the long wall-post! The more I was grasping it, the more ideas coming…

TL;DR:

we probably should not implement K&T as is now.
use re-change period time for which segment cannot be active once it fired (from SNNs)
- biological!
- would probably replace artificial potentialPct (overlap of potential pools of columns)
- forces multiple columns/segments to cover same “feature” input field.
- sorry, another new idea
use modified K&T (differences on dendrites) + dropout to learn not too much, but not too little
attention layer with inhibitory feedback to this layer?
hierarchical SP/TMs to learn on long subsequences

Daniel_MK · July 15, 2019, 3:33am

I’m still a newbie in HTM theory, but I like the “Measure difference on the whole segment, not single synapse” idea and the use of a hierarchical setup. I believe that it’s unlikely that sequence memory in the brain works without any hierarchical structure.

Aside from modifying HTM algorithms, I was thinking that it’s also possible to change the inputs as to remove these sections. My idea was to group an input with the past inputs so that the unchanging section is hidden from TM. For instance, consider the following sequence:

0,1,0,0,0,1,0,1,0,0,0,1,0,1,0,0,0,1 [or 3 * (0,1,0,0,0,1)]

It could be transformed into this:

010,100,000,001,010,101,010,100,000,001,010,101,010,100,000,001

Or, using the following mapping:

{‘000’: ‘a’, ‘001’: ‘b’, ‘010’: ‘c’, ‘011’: ‘d’, ‘100’: ‘e’, ‘101’: ‘f’, ‘110’: ‘g’, ‘111’: ‘h’}

This:

c,e,a,b,c,f,c,e,a,b,c,f,c,e,a,b

The only constrain is how long these sections are, since increasing the size of the group will exponentially affect the number of possible categories. I’m for sure not the first person to propose that, but I think it was nice to share.

By the way, breznak, we have work to do in this second semester. The example above is part of my research up until now = )

Bitking · July 15, 2019, 4:52am

HTM is only an approximation of neuron behavior and it does not copy everything.

Sometime the missing part is unimportant. Sometimes it can be approximated like the k-winner in inhibitory inter-neurons.

When you have a powerful model that seems to work well it is easy to think that the biology works the same way and miss important details.

In this case - habituation. When you present the same input repetitively the neuron response diminishes. If this was added into HTM the ABBBBBBBC response would act very differently.

Searching in Grossberg’s ART model depends on this behavior.
See section 2.3 and 4.11.3 here:
http://www.scholarpedia.org/article/Adaptive_resonance_theory

vpuente · July 15, 2019, 3:08pm

In particular, I think a relevant part that might be affecting this is missing: is the adaptive nature of learning and forgetting. Modulatory effects prevent the neurons to learn/forget with pedal-to-metal. My wild guess is that once “the sequence” is learned, the learning rate progressively falls to zero.

Paul_Lamb · July 15, 2019, 5:42pm

I’ve not experienced this “reduce to sequence D->D” behavior with TM. In fact the algorithm has the opposite behavior in my experience (i.e. it remembers every single “d” as a different context in the sequence). Given enough iterations, TM will correctly predict the entire sequence from beginning to end (assuming layer is sized with sufficient capacity).

In my opinion, the solution to this problem involves at least three things which are not currently covered in the algorithm. One is cell fatigue, as others have pointed out. Another is pooling of commonly seen sequences into features which themselves can be part of a sequence (similar concept to composite objects that Numenta has been talking about in the current line of research). A third is timing, where instead of learning “dddddddddd…” you learn “d repeating for a while”.

dmac · July 27, 2019, 11:55pm

I’d like to point out, in case it’s not clear, that the cited paper (Kropff & Treves 07) is a different paper than the one describing grid cells. The cited paper is called “Uninformative memories will prevail: the storage of correlated representations and its consequences”. Also, I simplified the equations somewhat.

I think this effect is is created in the brain by NMDA receptors becoming desensitized. Here is a good review paper about NMDA:

NMDA Receptor Function and Physiological Modulation
K Zito, University of California at Davis, Davis, CA, USA
V Scheuss, Max-Planck-Institute for Neurobiology,
Martinsried, Germany

DIO: 10.1016/B978-008045046-9.01225-0

See figure 6

The calcium-dependent desensitization (also referred to
as calcium-dependent inactivation) provides a negative
feedback loop by which calcium entering the cell via
NMDA receptors in turn leads to the desensitization of
the receptor (Figure 6), although calcium from other
sources (voltage-gated calcium channels or release from
intracellular stores activated by second messenger cas-
cades) has the same effect.

dmac · July 28, 2019, 4:01pm

Premise B: Seen often, learned well

This is the main rule of statistical-based ML approaches, and Hebbian rule as well. If something is encountered often, we should learn it. → The more seen, the better learned.

this is what synaptic permanence and decay/punishing wrongly active synapses is for in HTM

Problem: Timeseries with (long) homogeneous subsequences

For example:
abcdddddd......ddddefg

“If something is encountered ~~often~~ at all, we should learn it.”

abcddddd......dddddefg
I argue that in this example, you’ve seen one good example of d. Since d is a single unchanging SDR, there is no information contained in the repetitions.

Topic		Replies	Views
Sequence learning and invariant representations Numenta Theory	2	831	July 15, 2016
Can HTM learn the probabilistic sequence? Numenta Theory	2	548	March 5, 2018
A couple papers on attention Numenta Theory	1	482	September 9, 2018
Idea from Jeff's book on intelligence Numenta Theory	5	853	October 2, 2018
HTM + Logic for sequence learning Machine Learning sequence-memory	2	480	November 16, 2023

Timeseries: attention to differences and importance of noise

High-level

Premise A: attention to changes

Premise B: Seen often, learned well

Problem: Timeseries with (long) homogeneous subsequences

State of the art solutions:

Numenta HTM: None

HTM.core: Kropff & Treves

My problem with Kropff & Treves

Problem 1: K&T satisfies premise A (differences) but breaks B (statistical learning)

Problem 2: Selective learning

Solution for problem 2: Measure difference on the whole segment, not single synapse

Solution for problem 1: Dropout, randomness

Side note: Dropout

Two high-level ideas

Attention module

Decouple serial SP->TM output

Premise B: Seen often, learned well

Problem: Timeseries with (long) homogeneous subsequences

Related topics