Deep Embedded Clustering as MoE

I always though that models should be hierarchically segmented, but currently this is implemented as MoE, which is supervised. So this chat with GPT is about implementing it via clustering. Apparently people tried it with centroid clustering, but not with connectivity clustering, as I think it should be.
Does it make sense, @neel_g?


Deep embedded clustering models are an area of research in machine learning that focus on combining representation learning with clustering, typically in an unsupervised manner. Here are some insights and resources on the topic:

  • The Deep Embedded Clustering (DEC) method is proposed as a way to learn feature representations and cluster assignments simultaneously using deep neural networks. It addresses the need for learning representations that are suitable for clustering analysis​​.
  • Deep Embedded K-Means Clustering has been noted for gaining momentum, leveraging the representational power of deep neural networks, such as autoencoders, to enhance the clustering process. The synergy between good representations and effective clustering is emphasized, where each aspect reinforces the other​​.
  • Contrastive deep embedded clustering is highlighted as a popular method in unsupervised learning due to its performance in data-mining applications. Models like Stacked Autoencoders are used to map data from the original space to a non-linear latent space, facilitating deep clustering​​.
  • An example is given where deep clustering models often employ autoencoders to assist traditional clustering algorithms, such as in the DEC algorithm and the Improved Deep Embedded Clustering (IDEC) algorithm. Autoencoders are used for pre-training before clustering is performed​​.
  • A novel DEC model, referred to as deep embedded clustering with cluster-level representation learning (DECCRL), focuses on learning global representations necessary to capture the overall data structure at the cluster level​​.
  • Finally, deep embedded clustering models are described as being able to jointly optimize node clustering and representations. The DEC method is mentioned as a pioneer in this field, using an autoencoder to learn representation vectors and a clustering loss to obtain clusters​​.

These resources suggest that there are several models and methods within the realm of deep embedded clustering, each with its own approach to learning representations and performing clustering, often using deep neural networks and autoencoders as fundamental components.


This sounds like a mixture of experts?


A Mixture of Experts (MoE) model and deep embedded clustering share the common theme of using multiple components to handle different parts of the data, but they are designed for different purposes.

  • Mixture of Experts (MoE): This is a model architecture that involves a gating network and a set of expert networks. The gating network decides how much each expert contributes to the final output based on the input data. Each expert is trained to handle different regions of the input space, and the final prediction is a weighted combination of the outputs from all experts. MoE is used for supervised learning tasks.
  • Deep Embedded Clustering: This approach is generally used in unsupervised learning, where the goal is to learn feature representations that are conducive to clustering. The models often use deep neural networks to project data into a latent space where it is easier to define clusters. Unlike MoE, deep embedded clustering doesn’t usually involve a gating mechanism to weigh the contribution of different “experts” (in this case, clusters or components of the model). Instead, the model learns a representation that naturally segments the data into clusters.

While MoE models are a form of ensemble learning with multiple specialized models (experts) operating under a gating mechanism, deep embedded clustering focuses on learning a single representation space that reflects the inherent structure of the data, aiming to segment the data into meaningful clusters without supervision.


MoE doesn’t have to be supervised, gating network can be a cluster, and subsequent retraining can be segmented too?


You’re correct that the Mixture of Experts (MoE) model does not inherently require supervision. The MoE architecture can indeed be adapted for unsupervised learning scenarios:

  • Unsupervised MoE: In this setup, the gating network could function similarly to a clustering algorithm, determining which ‘expert’ should handle a given input based on the learned cluster assignments rather than labels.
  • Clustering as Gating: If the gating network operates by clustering, each ‘expert’ would effectively become responsible for the data points associated with a given cluster. The experts could then further process or analyze the data points within their assigned cluster.
  • Segmented Retraining: Following the initial clustering by the gating network, each expert could be retrained or fine-tuned on their respective segment of the data. This can be an iterative process, with the gating network and experts being refined over time to better model the data distribution.

In this way, the experts are specialized to different subsets of the data, determined by the gating network’s clustering, and the overall architecture can learn a form of hierarchical clustering where each cluster has a specialized model (expert) associated with it. This allows the model to capture complex structures and relationships in the data by leveraging the diversity of the experts.


Has this been implemented?


The concept of unsupervised or semi-supervised Mixture of Experts (MoE) is not mainstream, but it aligns with ongoing research in areas that combine clustering and representation learning. Researchers are continuously exploring hybrid models that integrate principles from different architectures to leverage their strengths for complex tasks, including unsupervised learning.

To my knowledge as of the last update in April 2023, while there may not be a widely recognized standard implementation of a purely unsupervised MoE model that uses clustering as a gating mechanism, the idea has conceptual merit and could exist in specialized applications or as part of research that has not yet become mainstream. It would represent an interesting intersection of clustering algorithms with the MoE framework, potentially leading to novel approaches in unsupervised learning.


This is using centroid clustering, has it been done with connectivity clustering?


To date, the implementation of Mixture of Experts (MoE) models that utilize connectivity-based clustering methods, like hierarchical clustering or minimum spanning tree clustering, has not been widely reported in mainstream machine learning literature. The majority of MoE models and deep embedding clustering methods typically employ centroid-based approaches, like k-means or Gaussian mixtures, because they are more straightforward to integrate with the backpropagation algorithm used to train neural networks.

Connectivity-based clustering methods focus on the proximity or connectedness of data points, forming clusters based on their distance from each other, which can reveal complex structures in the data. These methods, while powerful, are computationally intensive and less straightforward to combine with gradient-based learning algorithms typically used in neural networks.

However, the concept of using connectivity-based clustering within a Mixture of Experts framework is intriguing and could potentially lead to novel algorithms that capture data relationships in a more nuanced manner. If such an approach has been explored, it would likely be within the domain of advanced research projects or specialized applications. For concrete implementations or studies, one would have to look into the latest research papers or experimental studies in machine learning, possibly at the intersection of neural network design and complex network analysis.


If you use cluster assignment as routing function in a MoE, by mapping an expert to each cluster then… I don’t know how well they will handle data placed at boundaries between clusters.

I would consider a wider learning experience span for each expert so they see not only same cluster data points but also nearby ones that do not belong to the cluster, so they can extrapolate for spaces in between the clusters.

Just popped it out of my inner mini-gpt


Right, the clusters should be fuzzy and overlapping. But centroid clustering seems conceptually incoherent to me, while connectivity / density based clustering is procedurally quite difficult.

1 Like

A few more thoughts:

A one-to-one mapping between a cluster and an expert will tend to generate imbalances in the dataset each expert is trained on.
E.G. In MNIST if you have an expert for each of 10 clusters (regardless how you build them let’s say you ended up with 10 clusters) one expert will see mostly “7”-s and a few “1”-s, a few “4”-s and so on. Because… guess what clustering in itself gets a decent match on classes (it just isn’t the most accurate)
Another possible imbalance (depending on algorithm) could be each expert gets assigned a different number of data points which may result in some experts getting under trained while others might overfit.

A workaround for the above imbalances is to have each expert assigned a pair of clusters (or more).
e.g. if there are 10 clusters an exhaustive pairing would need 45 experts each being routed a different pair of clusters.

The advantage of that is each expert trains to distinguish between two cases e.g. an expert learns to differentiate between images looking like “7” from the ones looking like cluster “2”.

For more clusters an exhaustive pairing gets expensive, in which case there can be other allocation criteria in order to fix imbalances. Like some no. of clusters assigned randomly to each expert such that each expert learns different … contrasts within the sub-dataset it “sees”.

In this case routing algorithm could pick experts with most “ambiguity”
What does this mean …
(e.g. mnist cluster-pair routing) if router sees an “5”-like image, normally it should pick all 9 experts that were trained with cluster centered on “5”. (5-0, 5-1, … 5-9) .

Yet it can pick fewer by selecting the 3 most ambiguous one. E.G. 5-2, 5-3, 5-6 MoEs if image is closer to “2”, “3” or “6” clusters than to the remaining “0”, “1”, “4”, “7”, “8”, “9” -like ones.

I hope this makes sense.

I’m not sure the way clusters are made matters as much as long the router behaves consistently across the training dataset in regard to whatever metric of similarity you use.

What might help to to reduce computational cost of the router (clustering cost) is dimensionality reduction. I recently read UMAP documentation where a classifier’s performance was improved significantly after reducing sizes of digit images to a ridiculously low 2 or 3 dimensions. I’m speculating here that clustering would benefit too, at least regarding computation costs and probably performance (quality/relevance of clusters) too.

1 Like

It depends on how you define cluster, there is no reason the definition should be different from “expert”.
I am talking about hierarchical clustering, so that’s just different levels.

Of course it matters, connectivity clustering works very differently, with the same similarity measure. If you want an expert that has access to relevant info, it should be able to trace it through nearest neighbors, even if they are very different from cluster average. Centroid clustering is a really perverse way to group data points, I think that’s basically searching under a streetlight, sheer convenience rather than functionality. I was torturing GPT on that a while back:,
Even if you want to cluster data by similarity only, the logical way to do that is a connectivity clustering over a histogram. That’s just redefining the dimensions you order the data in.

1 Like

The pipeline I talk above is:

Input (a row in X) → clustering algorithm → router → trainable expert(s) → output (a row in Y)

I won’t dwell into equating clusters with experts since that’s clear it is not the case.
If you suggest the router can be discarded by selecting one out of N experts based on the assigned (one out of N) cluster to any given input datapoint, I was tjust pointing that might not be the best routing strategy, regardless on how great the clustering algorithm is.

Exactly, if “water” is a relevant property an “expert” must learn about, if you feed it only sub aquatic data it will learn what every fish knows: there is no such thing called water. In order to learn properties of water, expert must it has be contrasted with data frames that lack water, otherwise will cut out or ignore most relevant water properties as they seem to appear in every data frame.

1 Like

There is no formal definition of what a cluster is, and the distinction from router would depend on that.
I guess you meant something like this:

Your point here is against clustering by similarity only. I agree with that, we need alternative clustering by variance. In fact it’s the most basic kind: edge detection in images, where the edge is a connectivity cluster of high-variance (gradient) pixels. Relative value of these two kinds of clustering is proportional to the rarity of similarity vs. variance, and the later is rare in basic 3x3 kernels.

This is why initial layers of CNN converge on detecting variance, while self-attention detects similarity (represented by dot product): similarity is rare/sparse in symbolic data. I have a chat on that: But we really need both, beyond the most basic kernels.
Note that this distinction also maps to data vs. operations in experts: operations alter the data, which basically adds variance.


Well, I think here-s where confusion popped out from, because many (if not most) ML folks think of clustering as a means of grouping whole data frames by whatever criteria one might find appropriate, while stuff like edge/corner/texture/eye detection is seen as either feature extraction, or segmentation (or whatever is in between), within the current data frame .

Technically yes, edge detection can be regarded as a means of grouping (== cluster) pixels within an image, yet if you aren’t clear about that point I (at least) got it you-re talking about grouping whole images.

I’m not sure whether this news is relevant to your quest, but it sounds like it might:


I’m not sure exactly what it would mean to do “connectively clustering” with MoE’s. Routing specific clusters of datapoints to specific groups of experts? that’s what we already do :thinking: and the dense counterpart is the continuous extension where you take a linear combination of all subnetworks

1 Like

Uh, can’t say I understand this. They have this opposition of embeddings to cross-attention, as if cross-attention doesn’t operate on embeddings.
Text is generally not a self-contained modality, so I don’t think you can have a consistent methodology to cluster it. To me, it’s not a modality at all, just a third-hand encoding of inputs from primary sensory modalities, especially vision.

The way I see it, experts should be clusters of datapoints, so your description sounds redundant. There is also a distinct gating network, I think that should simply be a higher-level cluster, with some power law correspondence between the size of dataset and the number of levels.
Dataset segmentation in MoE is vertical, through backprop from the gating network? I am suggesting that it should be lateral: all datapoints / lower clusters connected by strong pair-wise similarity &| difference, perhaps density-based. Or hybrid as GNN segmentation: also by the strength of lateral pair-wise edges between datapoints / lower clusters, but that strength is vertically trained.

1 Like

I still don’t get what you mean