Review of Avoiding Catastrophe: Active Dendrites Enable Multi-Task Learning in Dynamic Environments

Review by Yannic Kilcher Avoiding Catastrophe: Active Dendrites Enable Multi-Task Learning in Dynamic Environments (Review) - YouTube


Pretty good idea to open a thread on the paper :+1:

Its nice to see Numenta actually doing experiments and competing against benchmarks, but I suppose overall this paper was pretty mild. Pretty much some aspects of the experiments were … ‘weird’ to put it politely and comparison to DL baseline wasn’t an MoE despite it being the simplest and most equal architecture to what the authors propose minus the sparsity.

Using the one-hot encoded vector also seems a bit off IMO, this should be derived from the environment automatically and shouldn’t be there in the training at all? and then later on they ditch the idea for “prototypes” which is an even more bizarre concept?

(This was brought up by kilcher in the video in some form or the other too, so clarification regarding those points would be much appreciated)

Overall, I think the results I was looking for just weren’t there - though I applaud the effort and find their approach very interesting. looking forward to more papers aggressively grounding Numenta’s work :slight_smile:

P.S: still a newbie in this field, my aim is to not take things offensively rather to simply express curiosity regarding some inconsistencies

I believe there will be an interview with some of the authors too. That will hopefully be even more interesting.

I agree it is good to see something concrete. That the approach uses SGD gives me the impression that HTM has been abandoned.


Can someone please explain some details upon how Permuted MNIST test works?

For MNIST we have 60000 digits training dataset, 10 classes and 10000 digits test dataset.
What I get is in Permuted MNIST with T tasks (e.g. T=10):

  • each task gets assigned a fixed random permutation of the input space (784 pixels in a digit image)
  • these T permutations are used to generate 60000xT training dataset.
  • Whatever “classifier” is used to identify digits is then trained with this “expanded” dataset.

What I don’t get is:

  • Do the output classes remain 10 or there will be 10xT classes too? Which means the classifier should identify not only the value of a scrambled input digit but also the Task to which it belongs?

  • Is the classifier trained one task at a time e.g. first train for task 1, then 2, 3, … or the 60000xT digits are all shuffled in a large dataset? I mean the ordered case would provide a more meaningful measurement of forgetting by evaluating the classifier against the first dataset after it was trained 1, 2… ,T subsequent datasets

@neel_g - what’s a MoE?

MoE: majority of experts. It’s sort of like an ensemble voting mechanism. Depending on the context it could refer to a panel of humans attempting the same task, or an ensemble of classifiers.

Unless I am mistaken, the categories are the same for each task. The purpose of using one specific perturbation per task is to see if the network is susceptible to forgetting the previously learned associations when those samples are no longer included in the training set. In other words, does the internal mapping from inputs to outputs drift from the original (or any previous) dataset.

1 Like

Yep, here it is:

Big thanks to @abhi, Karan Grewal and @akash_velu. That was a great interview.

Ye, just a tiny correction - “Mixture of Experts” or MoE is basically just like that, except its all learnty and routed implcitly rather than having explicit connections for what Numenta did. Analogously, the context vector and gating mechanism is learnt and regulated by backpropped gradients themselves.

My issue was that they didn’t try such an obviously equal architecture, leaning on something like XdG. I feel like I sound like a reviewer :sweat_smile: but its a little weird all-in-all.

Was it? I was under the understanding that TBT theory in his book relies heavily on HTM :thinking: ?

It actually was not …


Seems AMD’s next-gen hardware leaked early :rofl:
I hope you aren’t a troll, because you are pasting your name literally everywhere as if you’re some crown prince I am supposed to be in awe of…

Anyways, mind elaborating more?

TBT does not go into much detail about the architecture of each “brain” and I think it relies more on the concept of reference frames than HTM. It is not clear to me that reference frames will be implemented using HTM - the lack of progress on the HTM repository and the focus on deep learning suggests a move away from HTM but that is just a guess on my part.

1 Like

Seems be similar: Multiplexing Could Give Neural Networks a Big Boost - IEEE Spectrum
Basically multiplexing data streams is like a neuron is supposed to multiplex dendrites?

Seems quite different to me. “It could use this combined input to carry out multiple tasks that it is trained on at the same time, such as both recognizing names and classifying sentences, notes study senior author Karthik Narasimhan” This makes it sound like they have not actually benchmarked with multiple tasks yet.

I guess the multiplexing could allow for more “overlap” in the different tasks and might share more activity between tasks. It would also be interesting to see if sparsity improves the multiplexing approach.

I guess you could combine the multiplexing approach with the dendrites approach, for example, speeding up the training of the dendritic approach. The degree of sparsity might need to decrease to keep performance.

Something closer to the dendritic approach might have N tasks and N distinct sparse networks then train each network on one task. It would be interesting to know if this performs better or worse.

This is a somewhat related paper that raises and interesting question [1803.03635] The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks Should sparsity be imposed by the architecture as per Numenta’s approach or should sparsity be found as an optimization of a potentially dense network? It seems likely to me that brains find an optimal sparse network but I don’t know the biology well enough to support that claim.

1 Like

I think the difference is that they are multiplexing data streams through the whole multi-layer network, while HTM neuron is only multiplexing streams in one node of the network?
But that node has compartmentalized dendrites, so it’s effectively a 5-8 layer network itself: Single cortical neurons as deep artificial neural networks - ScienceDirect

1 Like

My assumption is that they are combining multiple inputs into a single input. So instead of sending in X samples over X iterations you send in two samples at the same time i.e. X samples in X/2 iterations. The multiplexing approach is training one dense network whereas the dendrites approach can be thought of as training multiple sub-networks. The multiplexing approach should still suffer from catastrophic forgetting if they attempt continual learning.

It’s multiple inputs into multiple outputs: multiplexing and then demultiplexing.

Obviously, there is an output stage, but I was referring to what the input is i.e. it is not “data streams” going through the ANN but one “multiplexed” data stream. But I have not read the paper (just the article) so may be off on a tangent.

Ok so its a mixture not majority. Which means there can be one expert for each subdomain, in which case there is no vote it’s the designated expert output that counts.

That makes sense. To address some of your (@neel_g) issues with this paper:

  1. their test on permuted mnist the context selector is trained too. Only for the multitask RL problem the context task is manually designated and they give some reasons for why they do so. All I got is it’s quite tricky to do that (task inference) in RL.

  2. XdG man, MoE was easy to find, this one is really cryptic. Context dependent gating. Who would have think of that? Numenta actually has a previous discussion on a relevant paper. XdG is even quoted in Numenta paper stating it needs task information to be provided at both training and inference time

1 Like

Not if they multiplex data streams categorized as different, same as different input streams per dendrite / SDR. Catastrophic forgetting would be if those streams are actually active at the same time, not if only one has a significant number of non-zero elements.

Regarding on why testing against a “classical” MLP instead of other “context routing” proposals, my guess is that is very tricky thing to do because:

  • all these methods are still in an “cool tricks we have done” infancy stage, none of them have reached its full potential. There is lot to go into improving and optimizations, if they aren’t properly configured chasing some synthetic results e.g. accuracy at this stage could be very misleading
  • different methods presented in papers might have been tested against different problems/methodologies. Reproducing other works with similar conditions could easily lead to sub optimal models.
  • at this stage comparing to a well known and understood baseline like a MLP in various key points, could be more informative.

PS even the arguing whether learned tasks are better than “labeled” ones is still unclear how relevant it is. Apparently sure, unsupervised training is superior but:

  • in real world problems labeling a task or purposely training on a specific task is way less costly than labeling millions of pictures or documents.
  • also for us, humans, knowing what we learn is often trivially obvious. E.G. a kid learning to ride a bike there are lots of cues that it isn’t about Pythagoras theorem nor about swimming.
  • very often task inference is more easy than the task itself. Let’s take Permuted-MNIST. Due to the nature of how digits are printed in their 28x28 square, with >200 points out of 784 almost (>99.9%) always white, it’s trivial to implement a pretty good “task inferencer” with a simple unsupervised method like clustering. So in this particular test (Permuted-MNIST) I don’t think any self-taught task selection feature is really relevant.

Self-taught task selection becomes interesting if/when it handles very similar, but different inputs in less obvious ways - learning to reuse resources on similar inputs while allocating new resources for differences.

All good, but that really defeats the purpose - because at that point I might as well just use if/else and load task-specific models :slight_smile: this isn’t robust, nor is it generalizable, nor is it biological (which is pretty much what Numenta attempts to do). It can be claimed as an advantage over DL which has no such extra information, but that seems like a minor nitpick IMO I am willing to let pass.

Which is indeed pretty weird - they compare to an absolutely vanilla MLP before, which I find a bit unexpected. one doesn’t compare to MLP if that task was CV, or NLP? there’s a reason why papers accepted at big conferences test against a ton of baselines :man_shrugging: I would prefer some other DL architecture (perferably MoE, but honestly even vanilla RL stuff like A3C wouldn’t have been bad).

It’s not about showing whether other baselines can outperform Numenta’s work, but more about properly testing and giving an unbiased view of one’s finding - being genuinely interested in discovering new ideas - especially in a publication, rather than going against weak baselines just to get a paper out. I may be naive here to expect such fairplay, but I can’t help but yearn for a little more balanced community :slight_smile:

Simply put, its a weak baseline. you wouldn’t claim to have made a new discovery using dictionary lookup outperforming linear regression in NLP. Rather, you would try using old baselines like LSTMs and if serious, go against big boys like transformers.

Using a specific RL task without comparing to RL baselines is a bit…unsettling for me. Again, might be sounding like a reviewer here but that’s just my opinions :slight_smile:

Its a very simple reason - the moment you enter into hardcoding tricks, it becomes more and more like cheating. that’s what GOFAI went, and you end up with Gary Marcus swelling over how his hardcoded manual-labour-spent-coding algo outperformed other competitors in NetHack.

Intelligent behaviour is emergent, no reason to expect our “intelligent” systems to not do so too.

If it doesn’t matter for this paper (and you think you can do without it) - simply put, remove it. try it. publish it.

Lastly, the thing that’s really striking is how the paper shifts from a one-hot encoded context vector to “prototypes”. The scientific method should be to compare both one-hot encoding and prototypes comparing with both experiments to confirm validity.

I love what Numenta is doing here, but my points are about this unexplained inconsistencies in their testing methodologies rather than the ideas, which give a little less crediblity IMHO

1 Like