There’s a problem in learnning sequences with sections of UNchanging input. I’ll discuss existing solution and a flaw I think I found in it, as well as suggesting some solutions. But first some background:
High-level
Premise A: attention to changes
Means that we only notice when something changes. This has to do with anomaly detection in HTM and Jeff’s old example in his book On Intelligence “when you enter your room, you’d immediately notice the broken vase”.
Some anecdotal examples;
- predators do that (see “only” movement)
- likely on low-level (HW, retina cells?)
- a neuromorphic DVS camera I have worked with works that way to produce sparse inputs Event Cameras, Event camera SLAM, Event-based Vision, Event-based Camera, Event SLAM
- you don’t mind/notice the tick-tock sound in your bedroom
- or ignore the sensation of the clothes you have on
Question 1: Is this “rightful ignorance” hardcoded in HTM-level neuron regions, or is it done by a separate, more high-lelel attention module (att theory)?
Premise B: Seen often, learned well
This is the main rule of statistical-based ML approaches, and Hebbian rule as well. If something is encountered often, we should learn it. → The more seen, the better learned.
- this is what synaptic permanence and decay/punishing wrongly active synapses is for in HTM
Problem: Timeseries with (long) homogeneous subsequences
For example:
abcdddddd......ddddefg
State of the art solutions:
Numenta HTM: None
There’s no fix for this in the original HTM : What happens is SP will learn mostly D, and TM will reduce to sequence D->D.
HTM.core: Kropff & Treves
Implemented by @dmac to community htm.core:
* @params timeseries - Optional, default false. If true AdaptSegment will not
* apply the same learning update to a synapse on consequetive cycles, because
* then staring at the same object for too long will mess up the synapses.
* IE Highly correlated inputs will cause the synapse permanences to saturate.
* This change allows it to work with timeseries data which moves very slowly,
* instead of the usual HTM inputs which reliably change every cycle. See
* also (Kropff & Treves, 2007. http://dx.doi.org/10.2976/1.2793335).
(btw, I could not find a topic here where this were discussed?)
- probably we should rename from
timeseries
tolongSteadyInputs
.
My problem with Kropff & Treves
I’ll have to restudy the paper, but definitely with the implementation (of K&T) we have.
void Connections::adaptSegment(const Segment segment,
const SDR &inputs,
const Permanence increment,
const Permanence decrement)
{
const auto &inputArray = inputs.getDense();
if( timeseries_ ) {
previousUpdates_.resize( synapses_.size(), 0.0f );
currentUpdates_.resize( synapses_.size(), 0.0f );
for( const auto synapse : synapsesForSegment(segment) ) {
const SynapseData &synapseData = dataForSynapse(synapse);
Permanence update;
if( inputArray[synapseData.presynapticCell] ) {
update = increment;
} else {
update = -decrement;
}
if( update != previousUpdates_[synapse] ) {
updateSynapsePermanence(synapse, synapseData.permanence + update);
}
currentUpdates_[ synapse ] = update;
}
}
Problem 1: K&T satisfies premise A (differences) but breaks B (statistical learning)
Take this example:
- only have inputs A, B
- occurence of A:B is 50:1
Then HTM(K&T) will 49x ignore A and learn it same (=as fast, as well, as “strong”) as B. That is wrong.
In extreme example 1000:1 this will ignore learning / extremely amplify noise.
Problem 2: Selective learning
Inputs:
- A: qwert xxx uiop
- B: qweab xxx uiop
- C: qwert zzz uiop
- (
D: 12345 678 90123)
ratios A:B:C:D = 50:50:1:50
Now learn sequences of A,B.C,D with given probabilities, ie AABBAAABADDDBBBDABCAAABBBBDDDC.
Problem, we focus extremely on the triplet xxx/yyy/zzz/678 part (biased because low probability of C), and we’d not learn the much stronger discriminator A/B = “the -ab- part in qweab”)
In this case, we learn:
- change D/anything (or anything/D, I’ll use just D/*)
- A/B (part -rt-/-ab-) 50:50
- (A or B )/ C (part xxx/zzz) 2*50:1
=> we much strongly prefer/focus on the uncommon element C.
Afterthought: Help me with this example. The idea of uncommon C actually could be right. But it’d introduce a lot of noise. Maybe something like
- detect C as “emerging player”
- move to short term memory
- see if actually used, or if it was just noise?
- learn / stash
Afterthought 2: ok, my problem was this still breaks premise A (Problem 1).
Solution for problem 2: Measure difference on the whole segment, not single synapse
The selective learning can be resolved if we do not focus on synapse:prev_synapse update differences, but measure it (the diff) on the whole input.
Maybe rather than the whole input, consider for differences a dendrite (its synapses). This is more biological (a dendrite is a stand-alone unit, “whole input” is not).
Details:
- update difference not per single synapse, but per all synapses on segment
- this will drive segments to grow on “boundary where something happens”
- add some noise tollerance. Ie ignore not only if prev == current, but if change prev/curr <= 2%.
Example:
- this will create an ideal dendrite on -(rt/ab)/xx- boundary.
- and on -xx-ui- → recognizes C
- and on -rt/ab- → recognizes A/B
Solution for problem 1: Dropout, randomness
Problem A:B = 50:1 , we learn it 1:1
Side note: Dropout
I’ve recently implemented HTM MNIST experiment with dropout
Dropout WIP by breznak · Pull Request #535 · htm-community/htm.core · GitHub
Dropout is a commonly used technique in deeplearning that prevents overfitting by intentionally adding noise to the output function (on the active columns on HTM).
Dropout is similar (same?) to noise in input, and HTM proved to be very robust to noise, as well as to dropout.
On MNIST this slightly improved performance (1-2%), but this slight increase is equal to that of boosting.
PS: I’ll have to open discussion for dropout!
Anyways, back to the problem, and a solution.
In the 49x cases where we ignore the seen value, if we apply quite strong (5-10-20%) dropout to the active synapses, we’ll cause a portion of the previousUpdates_ to change every round.
- we change (= will learn in next round) only a portion of the synapses → no huge changes that would leave to re-learning
- the changes happen randomly and sparsely
- but cummulatively, if we keep selecting from the same pool (pattern A) then we’d strenghten the whole representation for A (in course of many small changes)
- boosting probably must be disabled, because it’d go crazy over the small changes and rest of empty inputs
Two high-level ideas
Attention module
Can we remove the “time series functionality” directly from Connections, and outsource it to a new Attention layer?
Its responsibility:
- detect new patterns and decide if new pattern, or spam/noise
- decide when an input becomes “boring” and ignored, and when it is learned to balance the strenghts of patterns according to their probabilities
- can we use external inputs (which we have in Connections) to also serve as “external inhibitory inputs”? Directed by Attention.
Decouple serial SP->TM output
For some reason HTM is not based on Spinking neural networks which are biological. (I guess practical computation resources?)
- seq: AAAABAAAA
- this would require a hierarchical (2) SP/TM setup:
- so first layer predicts next letter: SP->TP->next? (A->A->A…)
- higher layer:
- “ignores same inputs”
- this is done by requiring some number of activations, which will not happen with solution for P2: dropout.
- this SP’ → TM’
- SP’ will process all inputs
- TM’ will process only much less (changes A/B) (so from seeing first A, TM’ predicts B)