Review by Yannic Kilcher Avoiding Catastrophe: Active Dendrites Enable Multi-Task Learning in Dynamic Environments (Review) - YouTube
Pretty good idea to open a thread on the paper
Its nice to see Numenta actually doing experiments and competing against benchmarks, but I suppose overall this paper was pretty mild. Pretty much some aspects of the experiments were ⌠âweirdâ to put it politely and comparison to DL baseline wasnât an MoE despite it being the simplest and most equal architecture to what the authors propose minus the sparsity.
Using the one-hot encoded vector also seems a bit off IMO, this should be derived from the environment automatically and shouldnât be there in the training at all? and then later on they ditch the idea for âprototypesâ which is an even more bizarre concept?
(This was brought up by kilcher in the video in some form or the other too, so clarification regarding those points would be much appreciated)
Overall, I think the results I was looking for just werenât there - though I applaud the effort and find their approach very interesting. looking forward to more papers aggressively grounding Numentaâs work
P.S: still a newbie in this field, my aim is to not take things offensively rather to simply express curiosity regarding some inconsistencies
I believe there will be an interview with some of the authors too. That will hopefully be even more interesting.
I agree it is good to see something concrete. That the approach uses SGD gives me the impression that HTM has been abandoned.
Thanks.
Can someone please explain some details upon how Permuted MNIST test works?
For MNIST we have 60000 digits training dataset, 10 classes and 10000 digits test dataset.
What I get is in Permuted MNIST with T tasks (e.g. T=10):
- each task gets assigned a fixed random permutation of the input space (784 pixels in a digit image)
- these T permutations are used to generate 60000xT training dataset.
- Whatever âclassifierâ is used to identify digits is then trained with this âexpandedâ dataset.
What I donât get is:
-
Do the output classes remain 10 or there will be 10xT classes too? Which means the classifier should identify not only the value of a scrambled input digit but also the Task to which it belongs?
-
Is the classifier trained one task at a time e.g. first train for task 1, then 2, 3, ⌠or the 60000xT digits are all shuffled in a large dataset? I mean the ordered case would provide a more meaningful measurement of forgetting by evaluating the classifier against the first dataset after it was trained 1, 2⌠,T subsequent datasets
@neel_g - whatâs a MoE?
MoE: majority of experts. Itâs sort of like an ensemble voting mechanism. Depending on the context it could refer to a panel of humans attempting the same task, or an ensemble of classifiers.
Unless I am mistaken, the categories are the same for each task. The purpose of using one specific perturbation per task is to see if the network is susceptible to forgetting the previously learned associations when those samples are no longer included in the training set. In other words, does the internal mapping from inputs to outputs drift from the original (or any previous) dataset.
Ye, just a tiny correction - âMixture of Expertsâ or MoE is basically just like that, except its all learnty and routed implcitly rather than having explicit connections for what Numenta did. Analogously, the context vector and gating mechanism is learnt and regulated by backpropped gradients themselves.
My issue was that they didnât try such an obviously equal architecture, leaning on something like XdG
. I feel like I sound like a reviewer but its a little weird all-in-all.
Was it? I was under the understanding that TBT theory in his book relies heavily on HTM ?
It actually was not âŚ
Processor REZA SANAYE
Seems AMDâs next-gen hardware leaked early
I hope you arenât a troll, because you are pasting your name literally everywhere as if youâre some crown prince I am supposed to be in awe ofâŚ
Anyways, mind elaborating more?
TBT does not go into much detail about the architecture of each âbrainâ and I think it relies more on the concept of reference frames than HTM. It is not clear to me that reference frames will be implemented using HTM - the lack of progress on the HTM repository and the focus on deep learning suggests a move away from HTM but that is just a guess on my part.
Seems be similar: Multiplexing Could Give Neural Networks a Big Boost - IEEE Spectrum
Basically multiplexing data streams is like a neuron is supposed to multiplex dendrites?
Seems quite different to me. âIt could use this combined input to carry out multiple tasks that it is trained on at the same time, such as both recognizing names and classifying sentences, notes study senior author Karthik Narasimhanâ This makes it sound like they have not actually benchmarked with multiple tasks yet.
I guess the multiplexing could allow for more âoverlapâ in the different tasks and might share more activity between tasks. It would also be interesting to see if sparsity improves the multiplexing approach.
I guess you could combine the multiplexing approach with the dendrites approach, for example, speeding up the training of the dendritic approach. The degree of sparsity might need to decrease to keep performance.
Something closer to the dendritic approach might have N tasks and N distinct sparse networks then train each network on one task. It would be interesting to know if this performs better or worse.
This is a somewhat related paper that raises and interesting question [1803.03635] The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks Should sparsity be imposed by the architecture as per Numentaâs approach or should sparsity be found as an optimization of a potentially dense network? It seems likely to me that brains find an optimal sparse network but I donât know the biology well enough to support that claim.
I think the difference is that they are multiplexing data streams through the whole multi-layer network, while HTM neuron is only multiplexing streams in one node of the network?
But that node has compartmentalized dendrites, so itâs effectively a 5-8 layer network itself: Single cortical neurons as deep artificial neural networks - ScienceDirect
My assumption is that they are combining multiple inputs into a single input. So instead of sending in X samples over X iterations you send in two samples at the same time i.e. X samples in X/2 iterations. The multiplexing approach is training one dense network whereas the dendrites approach can be thought of as training multiple sub-networks. The multiplexing approach should still suffer from catastrophic forgetting if they attempt continual learning.
Itâs multiple inputs into multiple outputs: multiplexing and then demultiplexing.
Obviously, there is an output stage, but I was referring to what the input is i.e. it is not âdata streamsâ going through the ANN but one âmultiplexedâ data stream. But I have not read the paper (just the article) so may be off on a tangent.
Ok so its a mixture not majority. Which means there can be one expert for each subdomain, in which case there is no vote itâs the designated expert output that counts.
That makes sense. To address some of your (@neel_g) issues with this paper:
-
their test on permuted mnist the context selector is trained too. Only for the multitask RL problem the context task is manually designated and they give some reasons for why they do so. All I got is itâs quite tricky to do that (task inference) in RL.
-
XdG man, MoE was easy to find, this one is really cryptic. Context dependent gating. Who would have think of that? Numenta actually has a previous discussion on a relevant paper. XdG is even quoted in Numenta paper stating it needs task information to be provided at both training and inference time
Not if they multiplex data streams categorized as different, same as different input streams per dendrite / SDR. Catastrophic forgetting would be if those streams are actually active at the same time, not if only one has a significant number of non-zero elements.
Regarding on why testing against a âclassicalâ MLP instead of other âcontext routingâ proposals, my guess is that is very tricky thing to do because:
- all these methods are still in an âcool tricks we have doneâ infancy stage, none of them have reached its full potential. There is lot to go into improving and optimizations, if they arenât properly configured chasing some synthetic results e.g. accuracy at this stage could be very misleading
- different methods presented in papers might have been tested against different problems/methodologies. Reproducing other works with similar conditions could easily lead to sub optimal models.
- at this stage comparing to a well known and understood baseline like a MLP in various key points, could be more informative.
PS even the arguing whether learned tasks are better than âlabeledâ ones is still unclear how relevant it is. Apparently sure, unsupervised training is superior but:
- in real world problems labeling a task or purposely training on a specific task is way less costly than labeling millions of pictures or documents.
- also for us, humans, knowing what we learn is often trivially obvious. E.G. a kid learning to ride a bike there are lots of cues that it isnât about Pythagoras theorem nor about swimming.
- very often task inference is more easy than the task itself. Letâs take Permuted-MNIST. Due to the nature of how digits are printed in their 28x28 square, with >200 points out of 784 almost (>99.9%) always white, itâs trivial to implement a pretty good âtask inferencerâ with a simple unsupervised method like clustering. So in this particular test (Permuted-MNIST) I donât think any self-taught task selection feature is really relevant.
Self-taught task selection becomes interesting if/when it handles very similar, but different inputs in less obvious ways - learning to reuse resources on similar inputs while allocating new resources for differences.
All good, but that really defeats the purpose - because at that point I might as well just use if/else
and load task-specific models this isnât robust, nor is it generalizable, nor is it biological (which is pretty much what Numenta attempts to do). It can be claimed as an advantage over DL which has no such extra information, but that seems like a minor nitpick IMO I am willing to let pass.
Which is indeed pretty weird - they compare to an absolutely vanilla MLP before, which I find a bit unexpected. one doesnât compare to MLP if that task was CV, or NLP? thereâs a reason why papers accepted at big conferences test against a ton of baselines I would prefer some other DL architecture (perferably MoE, but honestly even vanilla RL stuff like A3C wouldnât have been bad).
Itâs not about showing whether other baselines can outperform Numentaâs work, but more about properly testing and giving an unbiased view of oneâs finding - being genuinely interested in discovering new ideas - especially in a publication, rather than going against weak baselines just to get a paper out. I may be naive here to expect such fairplay, but I canât help but yearn for a little more balanced community
Simply put, its a weak baseline. you wouldnât claim to have made a new discovery using dictionary lookup outperforming linear regression in NLP. Rather, you would try using old baselines like LSTMs and if serious, go against big boys like transformers.
Using a specific RL task without comparing to RL baselines is a bitâŚunsettling for me. Again, might be sounding like a reviewer here but thatâs just my opinions
Its a very simple reason - the moment you enter into hardcoding tricks, it becomes more and more like cheating. thatâs what GOFAI went, and you end up with Gary Marcus swelling over how his hardcoded manual-labour-spent-coding algo outperformed other competitors in NetHack.
Intelligent behaviour is emergent, no reason to expect our âintelligentâ systems to not do so too.
If it doesnât matter for this paper (and you think you can do without it) - simply put, remove it. try it. publish it.
Lastly, the thing thatâs really striking is how the paper shifts from a one-hot encoded context vector to âprototypesâ. The scientific method should be to compare both one-hot encoding and prototypes comparing with both experiments to confirm validity.
I love what Numenta is doing here, but my points are about this unexplained inconsistencies in their testing methodologies rather than the ideas, which give a little less crediblity IMHO