A good reason to compare it to MLP is it is applying certain “mutations” to an MLP and it makes sense to compare the “mutant” with its ancestor.
Also MLP in itself is not as outdated as you suggest. Many if not most deep learning models have somewhere at a certain stage in their structure one or more feed-forward fully connected layers which is equivalent with one or more layers in the MLP.
So it is not so much about bragging on “look how much better are we than the MLP” as to sending a signal to future explorers “look we found this could improve anything involving one or more fully connected dense layers”. IMO it is a much stronger proposition than comparing with a couple alternative architectures which cannot easily and readily improve an MLP layer.
Regarding self-learned selector … one point of the permuted MNIST test is to show that as long as you are in the realms of:
having labelled datasets (== supervised learning)
knowing the a good/bad answer beforehand e.g. language models
Then in these quite relevant realms, the task selection is self-learned and, contrary your argument here, one does not need a task-specific hand-coding or tuning.
Also about biological plausibility be honest: when was the last time you learned something without having a pretty good prior about what your task is? “go solve your math homework kid!”.
Sure we might have a more complex self taught “task selector” bu that can be treated separated problem - an external specialist if you like. You can design, train & test many kind of selectors without having to change the basic model architecture.
That does not mean anything - you can’t serve kWTA as a drop-in replacement for all FF layers, especially without interfacing witih dense representations. Nor does this gurantee that those layers would hold up in specific architectures without experimentation - this is a highly invalid argument, based on lots of assumptions.
I’m talking about the Meta-World v2 Multi-Task 10 (MT10) task where there’s an explicit one-hot encoded context vector
I have no idea what you mean here - are you implying that the model presented in the paper doesn’t need such tuning?
the state vector provided to the model should be enough of a prior, wouldn’t you think so? or do you have examples where RL algos are explicitly provided with such vectors and are highly successful in other benchmarks?
This is NOT a prior, and pretty much cheating. you don’t shift priors over experiments, from one-hot encoded context vectors to the so-called “prototypes”. It is pretty much exploiting what we mean by a prior - you might as well give the correct answers and call them priors as well?
This is pretty ridiculous if I am going to be honest
Do it then. revise the paper. Its pretty clear that this paper looks like a weak attempt without results to show “progress” in biological methods.
I would’ve been much more impressed if they hadn’t resorted to such base tricks and still shown the results as they are - I am interested in the method, not benchmarks. falsification only gets you so far, and you end up like Gary Marcus eking out blogposts to attract laymen and get attention.
The employment of high-performance servers and GPU accelerators for training deep neural network models have greatly accelerated recent advances in deep learning (DL). DL frameworks, such as TensorFlow, MXNet, and Caffe2, have emerged to assist DL researchers to train their models in a distributed manner. Although current DL frameworks scale well for image classification models, there remain opportunities for scalable distributed training on natural language processing (NLP) models
Are you sure? Because at the end of page 3 in that paper it is stated:
The benchmark is NOT about figuring out what the problem is but ability to learn to solve multiple problems.
And it makes sense if you consider one task is “open the window” and the another - in the same set - is “close the window”.
If you claim “the agent should infer a policy from the state of the window” then it will end up opening the window when it is closed and close it when it is open.
Try to apply that policy in real application and see what happens: you get stuck in an window open/close loop forever. What would be the purpose of that?
Inferring local goals within various contexts might be an important step towards general AI, but that is NOT the aim of the Meta-World benchmark as you claim.
From what I understand, they don’t do that either. kWTA is only used to generate a selector SDR the size of the hidden layer which has the effect:
on forward step to pass only those parameters corresponding to “1” bit in the SDR . Which is equivalent to multiplying the output of the layer with the SDR.
on back propagating step, only the weights & biases selected by 1-s in the SDR are updated.
And there is no need to do it in all FF layers doing it only in one (most expensive one) is sufficient to obtain a certain degree of context-specific behavior with a certain degree of forgetting resistance.
Well, what you replied earlier was soo off-topic to literally what the thread was about (not to mention your strange behaviour of pasting your signature everywhere) that you sound like someone trying to use GPT3 on the forum.
I couldn’t care less how many do you need - my question is simple, can you substitute the same network in a similar family, like a transformer and have it work?? I have no idea how you interpreted it.
My point was that simply demonstrating MLP doesn’t guarantee convergence on all architectures, while you went off a tangent explaining their usage
Nobody makes guarantees in this domain, everything here is almost alchemy: testing different ideas and watching the results. E.G. no one made any warranty the transformer paradigm should outperform CNNs on visual tasks at the time of “attention is all you need” paper.
There is however a strong indication on why it might work: the simple fact it potentially allows an increase in the number of trainable parameters with relatively small penalties in training/inference computational costs.
The simple fact similar contexts will activate an overlapping subset of active paramaters and dissimilar contexts will not, gives sufficient plausibility to the:
presumption that what is learned in a certain context (defined as state+goal) won’t be easily overridden by training in unrelated contexts.
presumption that what is learned in a certain context will be reusable in a similar context.
presumption that what is learned in any exact particular context will converge as the context-specific SDR will not select different active parameters.
which pretty much contradicts your earlier posts, which is why I was bringing this up
The paper clearly showed scaling the model does absolutely nothing to the performance. So scalability of the proposed model remains in doubt unless proven otherwise.
Which phrase I used sounds like I made any promise?
Any technique which can increase the number of parameters one can use for a given computational cost has a certain potential (notice the word doesn’t mean warranty) of increasing both performance and scalability.
If their performance target was “resistance to catastrophic forgetting” then it clearly showed a performance boost.