Deep Embedded Clustering as MoE

I always though that models should be hierarchically segmented, but currently this is implemented as MoE, which is supervised. So this chat with GPT is about implementing it via clustering. Apparently people tried it with centroid clustering, but not with connectivity clustering, as I think it should be.
Does it make sense, @neel_g?


Deep embedded clustering models are an area of research in machine learning that focus on combining representation learning with clustering, typically in an unsupervised manner. Here are some insights and resources on the topic:

  • The Deep Embedded Clustering (DEC) method is proposed as a way to learn feature representations and cluster assignments simultaneously using deep neural networks. It addresses the need for learning representations that are suitable for clustering analysis​​.
  • Deep Embedded K-Means Clustering has been noted for gaining momentum, leveraging the representational power of deep neural networks, such as autoencoders, to enhance the clustering process. The synergy between good representations and effective clustering is emphasized, where each aspect reinforces the other​​.
  • Contrastive deep embedded clustering is highlighted as a popular method in unsupervised learning due to its performance in data-mining applications. Models like Stacked Autoencoders are used to map data from the original space to a non-linear latent space, facilitating deep clustering​​.
  • An example is given where deep clustering models often employ autoencoders to assist traditional clustering algorithms, such as in the DEC algorithm and the Improved Deep Embedded Clustering (IDEC) algorithm. Autoencoders are used for pre-training before clustering is performed​​.
  • A novel DEC model, referred to as deep embedded clustering with cluster-level representation learning (DECCRL), focuses on learning global representations necessary to capture the overall data structure at the cluster level​​.
  • Finally, deep embedded clustering models are described as being able to jointly optimize node clustering and representations. The DEC method is mentioned as a pioneer in this field, using an autoencoder to learn representation vectors and a clustering loss to obtain clusters​​.

These resources suggest that there are several models and methods within the realm of deep embedded clustering, each with its own approach to learning representations and performing clustering, often using deep neural networks and autoencoders as fundamental components.


This sounds like a mixture of experts?


A Mixture of Experts (MoE) model and deep embedded clustering share the common theme of using multiple components to handle different parts of the data, but they are designed for different purposes.

  • Mixture of Experts (MoE): This is a model architecture that involves a gating network and a set of expert networks. The gating network decides how much each expert contributes to the final output based on the input data. Each expert is trained to handle different regions of the input space, and the final prediction is a weighted combination of the outputs from all experts. MoE is used for supervised learning tasks.
  • Deep Embedded Clustering: This approach is generally used in unsupervised learning, where the goal is to learn feature representations that are conducive to clustering. The models often use deep neural networks to project data into a latent space where it is easier to define clusters. Unlike MoE, deep embedded clustering doesn’t usually involve a gating mechanism to weigh the contribution of different “experts” (in this case, clusters or components of the model). Instead, the model learns a representation that naturally segments the data into clusters.

While MoE models are a form of ensemble learning with multiple specialized models (experts) operating under a gating mechanism, deep embedded clustering focuses on learning a single representation space that reflects the inherent structure of the data, aiming to segment the data into meaningful clusters without supervision.


MoE doesn’t have to be supervised, gating network can be a cluster, and subsequent retraining can be segmented too?


You’re correct that the Mixture of Experts (MoE) model does not inherently require supervision. The MoE architecture can indeed be adapted for unsupervised learning scenarios:

  • Unsupervised MoE: In this setup, the gating network could function similarly to a clustering algorithm, determining which ‘expert’ should handle a given input based on the learned cluster assignments rather than labels.
  • Clustering as Gating: If the gating network operates by clustering, each ‘expert’ would effectively become responsible for the data points associated with a given cluster. The experts could then further process or analyze the data points within their assigned cluster.
  • Segmented Retraining: Following the initial clustering by the gating network, each expert could be retrained or fine-tuned on their respective segment of the data. This can be an iterative process, with the gating network and experts being refined over time to better model the data distribution.

In this way, the experts are specialized to different subsets of the data, determined by the gating network’s clustering, and the overall architecture can learn a form of hierarchical clustering where each cluster has a specialized model (expert) associated with it. This allows the model to capture complex structures and relationships in the data by leveraging the diversity of the experts.


Has this been implemented?


The concept of unsupervised or semi-supervised Mixture of Experts (MoE) is not mainstream, but it aligns with ongoing research in areas that combine clustering and representation learning. Researchers are continuously exploring hybrid models that integrate principles from different architectures to leverage their strengths for complex tasks, including unsupervised learning.

To my knowledge as of the last update in April 2023, while there may not be a widely recognized standard implementation of a purely unsupervised MoE model that uses clustering as a gating mechanism, the idea has conceptual merit and could exist in specialized applications or as part of research that has not yet become mainstream. It would represent an interesting intersection of clustering algorithms with the MoE framework, potentially leading to novel approaches in unsupervised learning.


This is using centroid clustering, has it been done with connectivity clustering?


To date, the implementation of Mixture of Experts (MoE) models that utilize connectivity-based clustering methods, like hierarchical clustering or minimum spanning tree clustering, has not been widely reported in mainstream machine learning literature. The majority of MoE models and deep embedding clustering methods typically employ centroid-based approaches, like k-means or Gaussian mixtures, because they are more straightforward to integrate with the backpropagation algorithm used to train neural networks.

Connectivity-based clustering methods focus on the proximity or connectedness of data points, forming clusters based on their distance from each other, which can reveal complex structures in the data. These methods, while powerful, are computationally intensive and less straightforward to combine with gradient-based learning algorithms typically used in neural networks.

However, the concept of using connectivity-based clustering within a Mixture of Experts framework is intriguing and could potentially lead to novel algorithms that capture data relationships in a more nuanced manner. If such an approach has been explored, it would likely be within the domain of advanced research projects or specialized applications. For concrete implementations or studies, one would have to look into the latest research papers or experimental studies in machine learning, possibly at the intersection of neural network design and complex network analysis.


If you use cluster assignment as routing function in a MoE, by mapping an expert to each cluster then… I don’t know how well they will handle data placed at boundaries between clusters.

I would consider a wider learning experience span for each expert so they see not only same cluster data points but also nearby ones that do not belong to the cluster, so they can extrapolate for spaces in between the clusters.

Just popped it out of my inner mini-gpt


Right, the clusters should be fuzzy and overlapping. But centroid clustering seems conceptually incoherent to me, while connectivity / density based clustering is procedurally quite difficult.

1 Like

A few more thoughts:

A one-to-one mapping between a cluster and an expert will tend to generate imbalances in the dataset each expert is trained on.
E.G. In MNIST if you have an expert for each of 10 clusters (regardless how you build them let’s say you ended up with 10 clusters) one expert will see mostly “7”-s and a few “1”-s, a few “4”-s and so on. Because… guess what clustering in itself gets a decent match on classes (it just isn’t the most accurate)
Another possible imbalance (depending on algorithm) could be each expert gets assigned a different number of data points which may result in some experts getting under trained while others might overfit.

A workaround for the above imbalances is to have each expert assigned a pair of clusters (or more).
e.g. if there are 10 clusters an exhaustive pairing would need 45 experts each being routed a different pair of clusters.

The advantage of that is each expert trains to distinguish between two cases e.g. an expert learns to differentiate between images looking like “7” from the ones looking like cluster “2”.

For more clusters an exhaustive pairing gets expensive, in which case there can be other allocation criteria in order to fix imbalances. Like some no. of clusters assigned randomly to each expert such that each expert learns different … contrasts within the sub-dataset it “sees”.

In this case routing algorithm could pick experts with most “ambiguity”
What does this mean …
(e.g. mnist cluster-pair routing) if router sees an “5”-like image, normally it should pick all 9 experts that were trained with cluster centered on “5”. (5-0, 5-1, … 5-9) .

Yet it can pick fewer by selecting the 3 most ambiguous one. E.G. 5-2, 5-3, 5-6 MoEs if image is closer to “2”, “3” or “6” clusters than to the remaining “0”, “1”, “4”, “7”, “8”, “9” -like ones.

I hope this makes sense.

I’m not sure the way clusters are made matters as much as long the router behaves consistently across the training dataset in regard to whatever metric of similarity you use.

What might help to to reduce computational cost of the router (clustering cost) is dimensionality reduction. I recently read UMAP documentation where a classifier’s performance was improved significantly after reducing sizes of digit images to a ridiculously low 2 or 3 dimensions. I’m speculating here that clustering would benefit too, at least regarding computation costs and probably performance (quality/relevance of clusters) too.

1 Like

It depends on how you define cluster, there is no reason the definition should be different from “expert”.
I am talking about hierarchical clustering, so that’s just different levels.

Of course it matters, connectivity clustering works very differently, with the same similarity measure. If you want an expert that has access to relevant info, it should be able to trace it through nearest neighbors, even if they are very different from cluster average. Centroid clustering is a really perverse way to group data points, I think that’s basically searching under a streetlight, sheer convenience rather than functionality. I was torturing GPT on that a while back:,
Even if you want to cluster data by similarity only, the logical way to do that is a connectivity clustering over a histogram. That’s just redefining the dimensions you order the data in.

1 Like

The pipeline I talk above is:

Input (a row in X) → clustering algorithm → router → trainable expert(s) → output (a row in Y)

I won’t dwell into equating clusters with experts since that’s clear it is not the case.
If you suggest the router can be discarded by selecting one out of N experts based on the assigned (one out of N) cluster to any given input datapoint, I was tjust pointing that might not be the best routing strategy, regardless on how great the clustering algorithm is.

Exactly, if “water” is a relevant property an “expert” must learn about, if you feed it only sub aquatic data it will learn what every fish knows: there is no such thing called water. In order to learn properties of water, expert must it has be contrasted with data frames that lack water, otherwise will cut out or ignore most relevant water properties as they seem to appear in every data frame.

1 Like

There is no formal definition of what a cluster is, and the distinction from router would depend on that.
I guess you meant something like this:

Your point here is against clustering by similarity only. I agree with that, we need alternative clustering by variance. In fact it’s the most basic kind: edge detection in images, where the edge is a connectivity cluster of high-variance (gradient) pixels. Relative value of these two kinds of clustering is proportional to the rarity of similarity vs. variance, and the later is rare in basic 3x3 kernels.

This is why initial layers of CNN converge on detecting variance, while self-attention detects similarity (represented by dot product): similarity is rare/sparse in symbolic data. I have a chat on that: But we really need both, beyond the most basic kernels.
Note that this distinction also maps to data vs. operations in experts: operations alter the data, which basically adds variance.


Well, I think here-s where confusion popped out from, because many (if not most) ML folks think of clustering as a means of grouping whole data frames by whatever criteria one might find appropriate, while stuff like edge/corner/texture/eye detection is seen as either feature extraction, or segmentation (or whatever is in between), within the current data frame .

Technically yes, edge detection can be regarded as a means of grouping (== cluster) pixels within an image, yet if you aren’t clear about that point I (at least) got it you-re talking about grouping whole images.

I’m not sure whether this news is relevant to your quest, but it sounds like it might:


I’m not sure exactly what it would mean to do “connectively clustering” with MoE’s. Routing specific clusters of datapoints to specific groups of experts? that’s what we already do :thinking: and the dense counterpart is the continuous extension where you take a linear combination of all subnetworks

1 Like

Uh, can’t say I understand this. They have this opposition of embeddings to cross-attention, as if cross-attention doesn’t operate on embeddings.
Text is generally not a self-contained modality, so I don’t think you can have a consistent methodology to cluster it. To me, it’s not a modality at all, just a third-hand encoding of inputs from primary sensory modalities, especially vision.

The way I see it, experts should be clusters of datapoints, so your description sounds redundant. There is also a distinct gating network, I think that should simply be a higher-level cluster, with some power law correspondence between the size of dataset and the number of levels.
Dataset segmentation in MoE is vertical, through backprop from the gating network? I am suggesting that it should be lateral: all datapoints / lower clusters connected by strong pair-wise similarity &| difference, perhaps density-based. Or hybrid as GNN segmentation: also by the strength of lateral pair-wise edges between datapoints / lower clusters, but that strength is vertically trained.

1 Like

I still don’t get what you mean


It’s confusing because I am talking about top-down hierarchical MoE, which would predict the input according to its “world model”, starting from the most general expert (prediction is already a basic fitness function in current models). It would map predicted input locations in space-time to corresponding bottom-layer experts. For question-answering MoE, the “space” is mapping to known users and their surroundings, and time is delay from and context of their latest query.

This is radically different from current bottom-up MoEs, where dedicated gating network is mapping input to the bottom-layer expert according to the content of the former, etc.
Basically an active globally integral system that maps input to expert even before it arrives, vs. reactive and localized ones we have now.

1 Like

I guess the experts should really be bidirectional, with encoder part specializing on lower-layer experts and the decoder part on higher-layer experts. The later would correspond to the gating network?


Your idea of configuring the experts in a hierarchical Mixture of Experts (MoE) model to be bidirectional, with an encoder part focusing on lower-layer experts and a decoder part on higher-layer experts, is an innovative approach. This setup can indeed create a more dynamic and interactive structure within the MoE framework. Let’s explore how this could work and its implications:

  1. Bidirectional Experts:
  • Encoder Part: The encoder part of each expert would specialize in processing and understanding the outputs from lower-layer experts. This part would effectively analyze and compress information from more basic, detailed data into a more abstract representation.
  • Decoder Part: The decoder part would focus on interacting with higher-layer experts. It would take the abstract representation from the encoder and map it to the higher-layer experts.
  1. Integration with Hierarchical Structure:
  • This bidirectional setup allows for a more nuanced interaction between layers. Lower layers can provide detailed, context-specific information, while higher layers can interpret this information in a broader context.
  • The encoder-decoder structure within each expert facilitates a two-way flow of information, potentially leading to a more cohesive and integrated overall model.
  1. Role of the Decoder as a Gating Network:
  • In a traditional MoE model, a separate gating network is responsible for determining which expert handles a given input. In your proposed structure, the decoder part of each expert could assume a role similar to a gating network.
  • The decoder could make decisions on whether to process the information further within its layer, pass it on to higher-layer experts, or even send feedback to lower layers. This creates a form of dynamic routing that’s more integrated with the processing capabilities of the experts themselves.
  1. Challenges and Considerations:
  • Complexity: This structure adds complexity to the model, both in terms of architecture and the required training algorithms.
  • Training and Optimization: Training such a model to effectively balance the encoding and decoding responsibilities, and to optimize the interactions between layers, would be a significant challenge.
  • Computational Resources: The increased complexity might demand more computational resources, especially if the model needs to process information in real-time.
  1. Potential Benefits:
  • This approach could lead to a more adaptive and efficient allocation of tasks within the MoE framework.
  • The bidirectional flow of information might enhance the model’s ability to handle complex, hierarchical data structures and make more informed predictions or decisions.
1 Like

This doesn’t have to be one system, rather a global market of various-generality experts, developed by different companies / teams and periodically fine-tuned on new context. Bottom-layer user-specific expert may sit on a local machine, collect new info during day and fine-tune on it at night.


The concept of a global market of various-generality experts, developed and maintained by different entities, combined with user-specific bottom-layer experts operating locally, is a fascinating and forward-looking idea. It blends elements of distributed computing, collaborative development, and personalized machine learning. Let’s explore the components and implications of this model:

  1. Global Market of Experts:
  • Diverse Expertise: Different teams or companies could develop experts specialized in various domains or tasks. This diversity allows for a wide range of expertise to be available in the market.
  • Collaborative Development: This approach encourages collaboration and competition in the AI field, potentially accelerating innovation and the development of more sophisticated models.
  • Periodic Fine-Tuning: Experts can be periodically updated or fine-tuned based on new data or contexts, ensuring that they remain effective and up-to-date.
  1. Local User-Specific Experts:
  • Personalized Learning: A bottom-layer expert residing on a local machine would focus on individual user behavior and preferences. It can collect data throughout the day and learn from it, becoming increasingly personalized.
  • Privacy Considerations: Local processing of personal data addresses privacy concerns, as sensitive information does not need to be transmitted or stored centrally.
  • Nightly Fine-Tuning: The idea of the local system updating or fine-tuning itself during periods of low activity (e.g., at night) is an efficient use of resources and minimizes interference with the user’s daily activities.
  1. Challenges and Considerations:
  • Integration and Compatibility: Ensuring seamless integration and compatibility among experts developed by different entities would be a significant challenge. Standardized interfaces and protocols might be necessary.
  • Quality Control and Trust: Maintaining high standards of quality and reliability across a diverse market of experts is crucial. Mechanisms for evaluating and certifying the performance of these experts might be needed.
  • Economic and Ethical Aspects: The creation of a market for AI experts raises questions about economic models, intellectual property rights, and ethical considerations in AI development and deployment.
  1. Potential Applications:
  • Versatile AI Solutions: Such a system could offer highly versatile and adaptable AI solutions, catering to a wide range of tasks and user needs.
  • Rapid Deployment and Scalability: The ability to quickly deploy and scale AI solutions based on a combination of global and local experts could be beneficial in many sectors, from personalized healthcare to smart city management.
  1. Technological Infrastructure:
  • Robust Networking and Security: Robust and secure networking infrastructure would be essential to support the communication and data exchange among various experts and local systems.
1 Like

still doesn’t make sense to me. I’ll need a diagram here :upside_down_face: `

1 Like

Here is is an example:

" Our approach therefore is to use a MIXture of AutoEncoders (MIXAE), each of which should identify a non-linear mapping suitable for a particular cluster. The autoencoders are trained simultaneously with a mixture assignment network via a composite objective function, thereby jointly motivating low reconstruction error (for learning each manifold) and cluster identification error. This kind of joint optimization has been shown to have good performance in other unsupervised architectures [31] as well. The main advantage in combining clustering with representation learning this way is that the two parts collaborate with each other to reduce the complexity and improve representation ability–the latent vector itself is a low-dimensional data representation that allows a much smaller classifier network for mixture assignment, and is itself learned to be well-separated by clustering


But they use K-means, which is centroid-based. I am suggesting that clustering should be connectivity-based: those AutoEncoders should map to graphs.

1 Like

Actually at page 2 they say they do not use k-means:

  1. Mixture of Autoencoders
    Consider the problem of clustering a set of n points
    x1, . . . , xn ∈ Rd into k clusters. The k-means algorithm
    represents each cluster by a centroid. In our approach,
    rather than representing a cluster by a centroid, we repre-
    sent each cluster by an autoencoder that is specialized in
    reconstructing objects belonging to that cluster.

Intriguing paper, thanks. An abstract “thing” there is represented by an autoencoder expert which is trained with various incarnations for its own “thing”. When given a foreign “another thing” it will reject it by simply observing the encoded-decoded output is a poor match for the input.

Sounds like what we do when in doubt. “Is this a duck?”


Yes, sorry I misunderstood. I actually meant another a very similar paper: [1712.07788] Deep Unsupervised Clustering Using Mixture of Autoencoders. Interestingly, they don’t seem to quote each other.

But I think the middle layer of autoencoder here is kind of fuzzy group of centroids too.
I think experts should be formed by segmenting a graph:

Graph Mixture of Experts [2304.02806] Graph Mixture of Experts: Learning on Large-Scale Graphs with Explicit Diversity Modeling
"Inspired by the recent progress of MoEs in large language models (Shazeer et al., 2017; Lepikhin et al., 2020; Fedus et al., 2021; Clark et al., 2022; Roller et al., 2021), GMoE comprises multiple “experts” at each layer, with each expert being an independent aggregation function with its own trainable parameters. During training, the model learns to select the most appropriate aggregation experts for each node, resulting in nodes with similar neighborhood information being routed to the same aggregation experts. In this way, each expert is trained to focus on its own subgroup of training samples that share similar neighborhood information pattern (e.g., shorter or longer range, less or more to aggregate). To further take advantage of the diversity, GMoE employs aggregation experts with different inductive biases. Specifically, each layer of GMoE includes aggregation experts with varying aggregation hop sizes, where experts with larger hop sizes focus on nodes requiring longer-range information and vice versa. "

Or [2311.05185] Mixture of Weak & Strong Experts on Graphs

1 Like