Papers
Topics
Authors
Recent
Search
2000 character limit reached

Cluster-Aware Upcycling Overview

Updated 22 April 2026
  • Cluster-Aware Upcycling is a methodology that leverages unsupervised clustering to adapt and specialize neural network experts in MoE architectures.
  • It partitions normalized activations with spherical k-means and employs truncated SVD to initialize expert weights and router parameters effectively.
  • Empirical results demonstrate improved zero-shot/few-shot performance and reduced redundancy, enabling robust adaptation in domain and transfer learning.

Cluster-Aware Upcycling (CAU) is a set of methodologies that leverage unsupervised clustering of data representations to facilitate model "upcycling": the adaptation or initialization of neural network architectures for new learning regimes, particularly in Mixture-of-Experts (MoE) architectures and domain adaptation scenarios. By aligning model structure and initialization with the intrinsic structure of observed data, CAU aims to promote expert specialization, reduce redundancy, enable robust adaptation under domain/category shift, and deliver improved performance across transfer and few-shot settings (Chu et al., 15 Apr 2026, Qu et al., 2023).

1. Cluster-Aware Upcycling in Mixture-of-Experts Specialization

Cluster-Aware Upcycling, as applied to Mixture-of-Experts, addresses critical limitations of prior sparse upcycling methods, where all experts are initialized identically from pretrained dense weights and the router is randomly initialized. This leads to expert symmetry and limited early specialization. CAU introduces semantic structure at initialization by partitioning layer activations into clusters, initializing expert weights to the principal subspaces uncovered within those clusters, and setting the router parameters directly to the cluster centroids. This breaks expert symmetry and encourages early specialization aligned with the dominant data distribution (Chu et al., 15 Apr 2026).

The procedure comprises the following steps:

  1. Activation Partitioning: For each dense FFN block, collect MM 2\ell_2-normalized input activations X={x1,,xM}RdX = \{x_1, \ldots, x_M\} \subset \mathbb{R}^d via a small calibration dataset. Partition XX into KK semantic clusters c(j){1,,K}c(j) \in \{1, \ldots, K\} using spherical k-means, optimizing:

max{μkμk2=1}j=1Mmaxk(μkxj)\max_{\{\mu_k \,|\, ||\mu_k||_2=1\}} \sum_{j=1}^M \max_{k} (\mu_k^\top x_j)

Cluster centroids μk\mu_k are computed via averaging and renormalization.

  1. Expert Weight Initialization: For each expert kk, use its assigned cluster activations XkRnk×dX_k \in \mathbb{R}^{n_k \times d}. Compute the whitening matrix 2\ell_20 such that 2\ell_21. Apply SVD to the whitened dense FFN weight 2\ell_22:

2\ell_23

Truncate to rank 2\ell_24 such that 2\ell_25, where 2\ell_26 and 2\ell_27. The expert is initialized as:

2\ell_28

This guarantees that 2\ell_29 closely approximates the original X={x1,,xM}RdX = \{x_1, \ldots, x_M\} \subset \mathbb{R}^d0 while ignoring low-energy directions outside the cluster.

  1. Router Initialization: Define router weight X={x1,,xM}RdX = \{x_1, \ldots, x_M\} \subset \mathbb{R}^d1 as the stack of centroid row-vectors X={x1,,xM}RdX = \{x_1, \ldots, x_M\} \subset \mathbb{R}^d2. For input X={x1,,xM}RdX = \{x_1, \ldots, x_M\} \subset \mathbb{R}^d3, the initial logits X={x1,,xM}RdX = \{x_1, \ldots, x_M\} \subset \mathbb{R}^d4 encode cosine similarity to each centroid, producing semantically meaningful initial routing.
  2. Expert-Ensemble Self-Distillation Loss (EESD): Maintain an Exponential Moving Average (EMA) of all model parameters. For each input, compute outputs from both sparse student (top-X={x1,,xM}RdX = \{x_1, \ldots, x_M\} \subset \mathbb{R}^d5 routed) and dense EMA teacher (ensemble over all experts), and minimize KL divergence between their routing logits and outputs. Overall loss is

X={x1,,xM}RdX = \{x_1, \ldots, x_M\} \subset \mathbb{R}^d6

where X={x1,,xM}RdX = \{x_1, \ldots, x_M\} \subset \mathbb{R}^d7 is a load-balancing regularizer.

2. Algorithmic Workflow and Implementation Details

The initialization and training pipeline for Cluster-Aware Upcycling in MoE consists of:

  • Collecting a calibration dataset to extract activations for clustering.
  • Preprocessing activations with X={x1,,xM}RdX = \{x_1, \ldots, x_M\} \subset \mathbb{R}^d8-normalization and PCA (e.g., 8× dimension reduction), followed by spherical k-means (typically X={x1,,xM}RdX = \{x_1, \ldots, x_M\} \subset \mathbb{R}^d9 clusters for each MoE layer).
  • For each cluster, constructing the data-aware whitening, performing truncated SVD on the transformed weight, selecting energy fraction XX0 (with XX1), and computing expert weights accordingly.
  • Initializing the router directly with normalized centroids.
  • Copying all initialized parameters to the EMA copy.
  • During training, for each mini-batch, executing sparse student forward pass (routing and top-XX2 composition), computing teacher (EMA/dense) outputs, and applying the combined loss. EMA parameters are updated at each step.

This procedure introduces low overhead: EESD adds ≈5.3% wall-clock time and ≈2.8% additional memory relative to standard MoE training (Chu et al., 15 Apr 2026).

3. Empirical Results and Analysis

Experiments were conducted on CLIP ViT-B/32 and ViT-B/16 models, upcycled at 4B samples, with further fine-tuning on 1.3B additional samples. Comparing Cluster-Aware Upcycling (CAU) to conventional Sparse Upcycling demonstrates:

Metric ViT-B/32 Sparse ViT-B/32 CAU ViT-B/16 Sparse ViT-B/16 CAU
Recall@1 (Zero-shot) 39.4% 39.6% 42.9% 43.5%
ImageNet-1K Accuracy 43.7% 44.1% 50.3% 50.8%
Few-shot (10-shot) Accuracy 57.8% 58.2%
Inter-expert cosine sim. 0.85 0.75 0.85 0.75
Routing entropy (nats) 0.9 0.6 0.9 0.6

Ablation studies confirm that XX3 clusters and XX4 yield optimal trade-offs between specialization and representational diversity. CAU achieves lower relative compactness (TrXX5), supporting enhanced diversity and disentanglement among experts. Routing entropy is reduced, indicating more confident and distinct expert assignment per token.

CAU avoids expert collapse: expert utilization remains balanced, with load-balancer loss at or below baseline. Pairwise expert similarity (cosine) drops by XX6, with lower redundancy.

4. Cluster-Aware Upcycling for Domain and Category Shift

A distinct application of Cluster-Aware Upcycling is realized in the GLC (Global and Local Clustering) framework for Source-Free Universal Domain Adaptation (SF-UniDA), targeting scenarios with both domain shift XX7 and arbitrary relationships between source and target label sets (XX8, XX9, KK0). The goal is to extract "known" target samples and reject "unknowns" by solely re-using a standard closed-set source model (Qu et al., 2023).

Key components:

1. Adaptive One-vs-All Global Clustering: For each source class KK1, positive/negative pools in target features are defined by softmax score ranking. Prototypes for each class and multi-cluster negatives are computed via averaging and KK2-means, respectively. Cosine similarity and a suppression weight KK3 control hard assignment:

KK4

Conflicts are resolved by maximal similarity, and if all KK5 the point is declared "unknown."

2. Local k-NN Consensus: For every KK6, its KK7 nearest neighbors are used to average soft label predictions, enforcing local consistency by minimizing:

KK8

3. Silhouette-Based Estimation of Cluster Number: Candidate target cluster counts KK9 are determined by maximizing average Silhouette score in K-means clustering of target features.

This global-local cluster-aware approach delivers state-of-the-art performance for SF-UniDA and related transfer settings. For example, on VisDA open-partial DA, GLC achieves c(j){1,,K}c(j) \in \{1, \ldots, K\}0 H-score versus c(j){1,,K}c(j) \in \{1, \ldots, K\}1 for prior UMAD, a c(j){1,,K}c(j) \in \{1, \ldots, K\}2 gain (Qu et al., 2023).

5. Theoretical and Practical Implications

Cluster-Aware Upcycling advances the utilization of pretrained models in both MoE specialization and domain adaptation by explicitly integrating semantic structure discovered in the observed input distribution. Breaking expert symmetry at initialization ensures that MoE systems do not suffer from redundancy and collapse but instead drive early specialization trajectories along relevant subspaces, as evidenced by lower pairwise cosine similarity and lower relative compactness in learned expert weights.

In domain adaptation, the global/local clustering approach leverages feature space geometry to achieve robust known/unknown separation under severe domain and category shift, without requiring access to source data or architecture modification.

A plausible implication is that such clustering-based upcycling could enable more adaptive, generalized transfer learning pipelines for large-scale DNNs, automating expert assignment, improving capacity use, and reducing the brittleness under shift. For MoEs, early expert disentanglement leads to higher, more stable performance in zero/few-shot settings. In domain adaptation, leveraging clustering for unknown rejection directly models the structure-matching principle, which is essential under non-closed-set shift.

6. Limitations and Extensions

Notable limitations include tuning the number of clusters c(j){1,,K}c(j) \in \{1, \ldots, K\}3 or c(j){1,,K}c(j) \in \{1, \ldots, K\}4, overhead of clustering and SVD computation in large-scale settings, and sensitivity to clustering hyperparameters (e.g., SVD energy threshold c(j){1,,K}c(j) \in \{1, \ldots, K\}5, EESD weight c(j){1,,K}c(j) \in \{1, \ldots, K\}6). While clustering overhead is minor in comparison to training, operational costs are non-zero.

Potential extensions identified include:

  • Dynamic clustering during training to continually align experts as representations evolve.
  • Hierarchical MoE, clustering at multiple granularities to map hierarchical data structure to expert allocations.
  • Applicability to modalities beyond vision (e.g., language), as long as meaningful activation clustering is feasible.
  • More sophisticated cluster number estimation (e.g., Bayesian nonparametrics for c(j){1,,K}c(j) \in \{1, \ldots, K\}7).
  • Generalization of the global/local clustering framework to semi-supervised or multi-source adaptation contexts.

Cluster-Aware Upcycling has been validated across standard vision benchmarks and in domains such as remote sensing, wildlife open-set adaptation, and single-cell genomics, indicating generality for upcycling under both architectural and data distribution shift (Chu et al., 15 Apr 2026, Qu et al., 2023).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Cluster-Aware Upcycling.