Cluster-Aware Upcycling Overview
- Cluster-Aware Upcycling is a methodology that leverages unsupervised clustering to adapt and specialize neural network experts in MoE architectures.
- It partitions normalized activations with spherical k-means and employs truncated SVD to initialize expert weights and router parameters effectively.
- Empirical results demonstrate improved zero-shot/few-shot performance and reduced redundancy, enabling robust adaptation in domain and transfer learning.
Cluster-Aware Upcycling (CAU) is a set of methodologies that leverage unsupervised clustering of data representations to facilitate model "upcycling": the adaptation or initialization of neural network architectures for new learning regimes, particularly in Mixture-of-Experts (MoE) architectures and domain adaptation scenarios. By aligning model structure and initialization with the intrinsic structure of observed data, CAU aims to promote expert specialization, reduce redundancy, enable robust adaptation under domain/category shift, and deliver improved performance across transfer and few-shot settings (Chu et al., 15 Apr 2026, Qu et al., 2023).
1. Cluster-Aware Upcycling in Mixture-of-Experts Specialization
Cluster-Aware Upcycling, as applied to Mixture-of-Experts, addresses critical limitations of prior sparse upcycling methods, where all experts are initialized identically from pretrained dense weights and the router is randomly initialized. This leads to expert symmetry and limited early specialization. CAU introduces semantic structure at initialization by partitioning layer activations into clusters, initializing expert weights to the principal subspaces uncovered within those clusters, and setting the router parameters directly to the cluster centroids. This breaks expert symmetry and encourages early specialization aligned with the dominant data distribution (Chu et al., 15 Apr 2026).
The procedure comprises the following steps:
- Activation Partitioning: For each dense FFN block, collect -normalized input activations via a small calibration dataset. Partition into semantic clusters using spherical k-means, optimizing:
Cluster centroids are computed via averaging and renormalization.
- Expert Weight Initialization: For each expert , use its assigned cluster activations . Compute the whitening matrix 0 such that 1. Apply SVD to the whitened dense FFN weight 2:
3
Truncate to rank 4 such that 5, where 6 and 7. The expert is initialized as:
8
This guarantees that 9 closely approximates the original 0 while ignoring low-energy directions outside the cluster.
- Router Initialization: Define router weight 1 as the stack of centroid row-vectors 2. For input 3, the initial logits 4 encode cosine similarity to each centroid, producing semantically meaningful initial routing.
- Expert-Ensemble Self-Distillation Loss (EESD): Maintain an Exponential Moving Average (EMA) of all model parameters. For each input, compute outputs from both sparse student (top-5 routed) and dense EMA teacher (ensemble over all experts), and minimize KL divergence between their routing logits and outputs. Overall loss is
6
where 7 is a load-balancing regularizer.
2. Algorithmic Workflow and Implementation Details
The initialization and training pipeline for Cluster-Aware Upcycling in MoE consists of:
- Collecting a calibration dataset to extract activations for clustering.
- Preprocessing activations with 8-normalization and PCA (e.g., 8× dimension reduction), followed by spherical k-means (typically 9 clusters for each MoE layer).
- For each cluster, constructing the data-aware whitening, performing truncated SVD on the transformed weight, selecting energy fraction 0 (with 1), and computing expert weights accordingly.
- Initializing the router directly with normalized centroids.
- Copying all initialized parameters to the EMA copy.
- During training, for each mini-batch, executing sparse student forward pass (routing and top-2 composition), computing teacher (EMA/dense) outputs, and applying the combined loss. EMA parameters are updated at each step.
This procedure introduces low overhead: EESD adds ≈5.3% wall-clock time and ≈2.8% additional memory relative to standard MoE training (Chu et al., 15 Apr 2026).
3. Empirical Results and Analysis
Experiments were conducted on CLIP ViT-B/32 and ViT-B/16 models, upcycled at 4B samples, with further fine-tuning on 1.3B additional samples. Comparing Cluster-Aware Upcycling (CAU) to conventional Sparse Upcycling demonstrates:
| Metric | ViT-B/32 Sparse | ViT-B/32 CAU | ViT-B/16 Sparse | ViT-B/16 CAU |
|---|---|---|---|---|
| Recall@1 (Zero-shot) | 39.4% | 39.6% | 42.9% | 43.5% |
| ImageNet-1K Accuracy | 43.7% | 44.1% | 50.3% | 50.8% |
| Few-shot (10-shot) Accuracy | – | – | 57.8% | 58.2% |
| Inter-expert cosine sim. | 0.85 | 0.75 | 0.85 | 0.75 |
| Routing entropy (nats) | 0.9 | 0.6 | 0.9 | 0.6 |
Ablation studies confirm that 3 clusters and 4 yield optimal trade-offs between specialization and representational diversity. CAU achieves lower relative compactness (Tr5), supporting enhanced diversity and disentanglement among experts. Routing entropy is reduced, indicating more confident and distinct expert assignment per token.
CAU avoids expert collapse: expert utilization remains balanced, with load-balancer loss at or below baseline. Pairwise expert similarity (cosine) drops by 6, with lower redundancy.
4. Cluster-Aware Upcycling for Domain and Category Shift
A distinct application of Cluster-Aware Upcycling is realized in the GLC (Global and Local Clustering) framework for Source-Free Universal Domain Adaptation (SF-UniDA), targeting scenarios with both domain shift 7 and arbitrary relationships between source and target label sets (8, 9, 0). The goal is to extract "known" target samples and reject "unknowns" by solely re-using a standard closed-set source model (Qu et al., 2023).
Key components:
1. Adaptive One-vs-All Global Clustering: For each source class 1, positive/negative pools in target features are defined by softmax score ranking. Prototypes for each class and multi-cluster negatives are computed via averaging and 2-means, respectively. Cosine similarity and a suppression weight 3 control hard assignment:
4
Conflicts are resolved by maximal similarity, and if all 5 the point is declared "unknown."
2. Local k-NN Consensus: For every 6, its 7 nearest neighbors are used to average soft label predictions, enforcing local consistency by minimizing:
8
3. Silhouette-Based Estimation of Cluster Number: Candidate target cluster counts 9 are determined by maximizing average Silhouette score in K-means clustering of target features.
This global-local cluster-aware approach delivers state-of-the-art performance for SF-UniDA and related transfer settings. For example, on VisDA open-partial DA, GLC achieves 0 H-score versus 1 for prior UMAD, a 2 gain (Qu et al., 2023).
5. Theoretical and Practical Implications
Cluster-Aware Upcycling advances the utilization of pretrained models in both MoE specialization and domain adaptation by explicitly integrating semantic structure discovered in the observed input distribution. Breaking expert symmetry at initialization ensures that MoE systems do not suffer from redundancy and collapse but instead drive early specialization trajectories along relevant subspaces, as evidenced by lower pairwise cosine similarity and lower relative compactness in learned expert weights.
In domain adaptation, the global/local clustering approach leverages feature space geometry to achieve robust known/unknown separation under severe domain and category shift, without requiring access to source data or architecture modification.
A plausible implication is that such clustering-based upcycling could enable more adaptive, generalized transfer learning pipelines for large-scale DNNs, automating expert assignment, improving capacity use, and reducing the brittleness under shift. For MoEs, early expert disentanglement leads to higher, more stable performance in zero/few-shot settings. In domain adaptation, leveraging clustering for unknown rejection directly models the structure-matching principle, which is essential under non-closed-set shift.
6. Limitations and Extensions
Notable limitations include tuning the number of clusters 3 or 4, overhead of clustering and SVD computation in large-scale settings, and sensitivity to clustering hyperparameters (e.g., SVD energy threshold 5, EESD weight 6). While clustering overhead is minor in comparison to training, operational costs are non-zero.
Potential extensions identified include:
- Dynamic clustering during training to continually align experts as representations evolve.
- Hierarchical MoE, clustering at multiple granularities to map hierarchical data structure to expert allocations.
- Applicability to modalities beyond vision (e.g., language), as long as meaningful activation clustering is feasible.
- More sophisticated cluster number estimation (e.g., Bayesian nonparametrics for 7).
- Generalization of the global/local clustering framework to semi-supervised or multi-source adaptation contexts.
Cluster-Aware Upcycling has been validated across standard vision benchmarks and in domains such as remote sensing, wildlife open-set adaptation, and single-cell genomics, indicating generality for upcycling under both architectural and data distribution shift (Chu et al., 15 Apr 2026, Qu et al., 2023).