NN, centroid clustering, search

Ok, I was not exactly using standard terminology above. Because I don’t know of any that fits. So I run it by Chat GTP, it has no clue either:

Lateral vs. vertical learning: connectivity-based clustering is done within each layer, while centroid-based clustering and backprop work through feedback between layers. Have they been framed in terms of this dichotomy before?

ChatGPT
As of my last knowledge update in September 2021, the terms “lateral learning” and “vertical learning” are not standard terminology in the field of machine learning to describe clustering and neural network learning techniques. However, the distinction you are making between operations within a single layer and feedback between layers is an interesting perspective.

To elaborate on this perspective:

Connectivity-Based Clustering: In hierarchical clustering, which is a form of connectivity-based clustering, you typically focus on the relationships between data points in a single data representation (which can be thought of as a single “layer” of representation). The algorithm measures distances or similarities between data points and groups them based on connectivity.

Centroid-Based Clustering: Similar to connectivity-based clustering, centroid-based clustering such as k-means also operates on a single data representation, measuring distances to centroids.

Backpropagation in Neural Networks: This is fundamentally different from clustering. Backpropagation is an algorithm used for training neural networks, where weights are adjusted based on errors. This involves feedback between layers, as you mention, where the error gradients are propagated back through the network layer by layer.

It seems like you are making a distinction between algorithms that operate on a single representation of the data (single “layer”) and those that involve interactions between multiple layers (as in neural networks). While this distinction is conceptually valid, as of September 2021, I’m not aware of standard terminology that frames this distinction as “lateral learning” versus “vertical learning.”

In academia and industry, clustering techniques (both connectivity-based and centroid-based) are generally considered as types of unsupervised learning algorithms, whereas backpropagation is an optimization technique used in training neural networks as part of supervised learning.

Regenerate response

1 Like

Yes they did lots of combinations one can think of, e.g. https://www.cs.toronto.edu/~hinton/absps/Outrageously.pdf

The basic points are:

  • have many expert networks eg. 1000.
  • have a trainable router network that outputs a 1000 size vector,
  • get the top K (e.g. 50) router vector outputs and pass the input through the corresponding 50 experts.
  • Outputs of the 50 experts are simply summed and fed into the next layer, which can be a following block of (router + experts)

The router could be fed the same input as the main network or be trained on some contextual data. Lots of options there too.

Personally, I would attempt to figure out a way to train router first (e.g. by some form of self-supervised clustering), such the experts are consistently selected by the router.

2 Likes

No. In MoEs, the E stands for “expert” because its a nice analagous concept - which I guess americans really love. They aren’t "expert"s in anything - AFAWK there hasn’t been any evidence to suggest that. The idea that they would specialize in some domain was the driving factor behind their development, but in reality SGD finds that offloading tasks to a multitude of experts and obtaining a cumulative works better.

The reason why people don’t like MoEs are because they’re insanely hard to train, very unstable and require clever (specific) tricks to scale properly - which is just not worth the effort.

2 Likes

They still have specializations, just overlapping and not easily defined?

Or hard to design properly. But not as hard as general parameterized connectivity clustering, especially without a coherent theory. I think this will change though: coding will get a lot easier with LLM tools.

And we get rapidly diminishing returns from current brute-forcing, especially with raw data: video, etc.
Intelligent design will ultimately win :).

1 Like

No - they have very little specialization. The routing uses top-k, so realistically all information just diffuses through k experts at a given time which is aggregated by the MoE.

Sure, just that MoEs are far from the optimal and stable architecture we need for scale; they carry no helpful inductive biases, and are more of a hindrance than anything due to noisy gradients and sensitivity to initialization…

2 Likes

Ok thanks, I assumed there is some truth in labels.
But that’s for conventional MoEs, where there are hundreds or thousands of experts.
Chat GTP is supposed to have only 8 experts, they can have different manually defined training sets: social media, literature, scientific literature, code, etc.?

1 Like

I doubt ChatGPT is MoE - and even if it is, I highly doubt its 8x experts alone. None of those experts are capable - individually or otherwise, to exhibit the same capability of reasoning GPT4 exhibits. Clearly, they’re doing something entirely different here…

1 Like

The “leaks” keep insisting GPT4 is actually a MoE, the most recent one being:

  • 16 MoE-s, 110B parameters each
  • 2 out of 16 are evaluated on each step

I think (few of the issues) what makes MoE-s having poor results are:

  • unstable router which means if you train both “experts” and router at the same time - and later router changes its choices during training, whatever expert weights were changed during initial stage training will not impact the end result. A significant part of the training compute budget would be wasted.
    A couple ideas to address that
    • use inference on a small, pretrained model of same depth as a router
      during both training and inferencing of the larger model.
    • or even get on with a raw reservoir as a router.
  • The second you pointed yourself - as SDR-educated folks here know, a low resolution SDR (eg 1/8 or 2/16) of the router lacks expressivity. A nice-to-have MoE would be picking 100 of 1000 “experts” but that makes branchless GPU training even more challenging.

There is a particular challenge of using MoE-s with transformers which stems from the fact transformers training efficiency comes from processing a whole length token window in a single forward step, while a context-sensitive router would pick some experts by “my name is” phrase and other experts by adding a new token e.g. “my name is John” phrase.

Maybe OpenAI stumbled into a relatively simple algorithmic trick to efficiently solve this conundrum, and that would explain the reason behind secrecy of the very fact they used MoE - to not hint competition into discovering the same trick.

PS part of the trick could be that, according to above leak, only FF layers within each transformer block are MoE-fied, the attention part - which would be a real nightmare to split into MoE-s, is still monolithic.

4 Likes

To keep speculating here, the leaks also suggested that, unlike usual transformer training, the training was done twice on the same dataset.

It could be the first training stage was simply training one single 110B transformer, then they forked it in 16 instances which were trained as MoE in a second stage on the same dataset.

This would save costs and should alleviate instability issues with training a MoE from scratch. They could even freeze attention layers in the second stage to save compute and use their outputs for stable routing.

3 Likes