Agglomerative Mixture-of-Experts Vision Models
- AMoE is a vision foundation model that unifies representations from distinct teacher models using a Mixture-of-Experts architecture and multi-teacher distillation.
- It employs an asymmetric relation-knowledge distillation loss to preserve teacher geometry, resulting in improved zero-shot and kNN transfer performance.
- Advanced token-balanced batching and agglomerative hierarchical clustering ensure efficient processing of multi-resolution data and balanced sample representation.
Agglomerative Mixture-of-Experts @@@@1@@@@ (AMoE) define a class of vision foundation models trained via multi-teacher distillation, integrating the representational strengths of distinct teacher models (specifically SigLIP2 and DINOv3) into a unified Mixture-of-Experts (MoE) student. AMoE advances multi-teacher distillation with a specialized asymmetric relation-knowledge distillation objective, token-balanced multi-resolution batching, and agglomerative data curation via hierarchical clustering and sampling, leading to markedly improved data efficiency, stable learning dynamics, and superior zero-shot and kNN transfer performance across visual tasks (Chaybouti et al., 23 Dec 2025).
1. Mixture-of-Experts (MoE) Student Architecture
The MoE student in AMoE consists of an 18-layer Transformer backbone with hidden dimension . Each MoE layer comprises 28 expert sub-networks, with a top- routing policy: per input token, a lightweight router projects to logits over 28 experts, applies a softmax to compute routing probabilities , and dispatches each token to its top-6 experts. This approach ensures only parameters are active per layer, enhancing parameter efficiency.
FlexAttention is employed to mask cross-image self-attention for sequences where multiple images are packed, isolating per-image computations. For teacher-specific representational adaptation, the shared backbone features are projected via distinct 1-layer MLP adapters into the SigLIP2 space (dimension 1152) and the DINOv3 space (dimension 1024). For SigLIP2, its attention-pooler is reused (kept frozen) to compute global summary embeddings.
The expert subnetwork of each MoE layer manifests as:
where .
After adding a class token and 4 register tokens per image, token features are processed as:
and projected to teacher spaces:
For SigLIP2, is partitioned into patches and summarized via the frozen SigLIP2 attention-pooler.
2. Asymmetric Relation-Knowledge Distillation Loss
Knowledge distillation in AMoE targets both global and local feature spaces per teacher. For each teacher and image :
- CLS Loss (global summary):
- Patch Loss:
- Register Loss (DINOv3 only):
- Total per-image loss:
Losses are averaged over the global batch. To further preserve the geometric structure of teacher embeddings, asymmetric relational knowledge distillation (ARKD) is applied. For image pairs , the batchwise ARKD loss pushes or pulls student summary embeddings to match the pairwise teacher geometry, but splits "shrink" and "expand" directions at the teacher's pairwise median distance and applies Smooth- loss only as appropriate.
This asymmetric loss yields
- Substantial improvements over both vanilla and symmetric relational KD in zero-shot and kNN accuracy.
3. Token-Balanced Batching for Multi-Resolution Data
Vision data exhibits large variation in native resolution, leading to per-image token counts ranging from (for images) to (for images). To address rank-to-rank (device-to-device) imbalance and unstable gradients, AMoE employs token-balanced batching, wherein images are dynamically packed together into per-device sequences constrained by a global maximum context length (e.g., 2304 tokens per sequence), as indicated by
1 2 3 4 5 6 7 8 9 10 11 12 |
function TokenBalancedBatch(images, C_max):
batch = []
curr_seq, curr_len = [], 0
for img in images:
L = 1 + 4 + num_patches(img)
if curr_len + L > C_max:
batch.append(curr_seq)
curr_seq, curr_len = [], 0
curr_seq.append(img)
curr_len += L
if curr_seq: batch.append(curr_seq)
return batch |
For each batch, FlexAttention masks cross-image attention; per-image losses are normalized, ensuring all images contribute equally regardless of resolution.
This strategy increases per-GPU throughput from 7.5k to 20k tokens/s and prevents catastrophic forgetting for low-resolution images, as established experimentally.
4. Agglomerative Hierarchical Clustering and Data Sampling
AMoE adopts hierarchical data curation, diverging from random web-scale sampling. The process, adapted from Vo et al. (2024), is as follows:
- Embedding: Encode 2.3B images (sourced from LAION-5B and DFN) using DINOv3 ViT-B.
- Subsampling and Clustering: Uniformly subsample 1B images. Learn a 4-level -means cluster hierarchy with designated centroid counts per level: 20M (L1), 500k (L2), 50k (L3), 20k (L4).
- Assignment: Assign the remaining 1.3B images to L1 centroids.
- Hierarchical Sampling: Sample 200M images by traversing the -means tree, uniformly selecting nodes at each level to flatten the long-tail distribution and ensure a balanced representation of fine-grained concepts.
Pseudocode (as published):
1 2 3 4 5 6 7 8 |
function HierarchicalSample(tree, N_target):
selected = []
while |selected| < N_target:
node = tree.root
while node is not leaf:
node = random.choice(node.children)
selected.append(random.choice(node.members))
return selected |
This yields the OpenLVD200M corpus—a 200M-image dataset with native resolutions from to , refined for balanced fine-grained coverage.
5. Experimental Protocol and Empirical Results
Models are trained across 32×A100 GPUs, using AdamW (, , , weight decay 0.02), with a learning rate that warms up from 0 to (500 steps) and decays cosine-wise to . Distillation is performed in two stages: images for 50k steps, then up to for 90k steps. The full MoE student, with 0.6B parameters (0.3B active), sees 230B tokens in total, 4.7× fewer than RADIOv2.5's 1.1T.
Table: Zero-shot and kNN Top-1 Accuracy at 512×512
| Method | Budget (TT) | INet (I–T) | INet (kNN) | C101 | CUB | Food | Flow | DTD | Air |
|---|---|---|---|---|---|---|---|---|---|
| RADIOv2.5-H | 1.1 | 78.69 | 79.96 | 88.69 | 81.47 | 94.09 | 88.23 | 69.57 | 70.32 |
| AMoE (ensemble) | 0.23 | 80.17 | 82.78 | 88.76 | 82.78 | 94.67 | 89.20 | 70.16 | 83.18 |
On retrieval (MSCOCO5k T2I/I2T, F30k T2I/I2T), AMoE consistently surpasses RADIOv2.5-H. On segmentation (Cityscapes/ADE20k/Pascal-VOC mIoU), AMoE matches or exceeds prior foundation models:
| Method | Cityscapes | ADE20k | Pascal-VOC |
|---|---|---|---|
| RADIOv2.5-H | 64.11 | 51.13 | 85.65 |
| AMoE | 64.89 | 51.37 | 84.40 |
Ablation studies confirm
- ARKD vs Symmetric RKD: ARKD delivers higher zero-shot and kNN accuracy ( vs and vanilla MT’s ).
- Data Curation: OpenLVD200M increases image–text avg (), kNN avg (), and retrieval T2I@1 () relative to random 200M sampling.
- Token-Balanced Batching: Improves low-resolution performance and theoretical compute throughput.
6. Algorithmic Workflows and Implementation
The forward-pass workflow and per-batch loss computation are summarized in the published pseudocode:
Forward pass:
1 2 3 4 5 6 7 8 |
def StudentForward(packed_tokens, mask): x = AddSpecialTokens(packed_tokens) H = MoETransformer(x, mask) z_dino = H @ W_dino H_sig = H @ W_sig z_sig_summ = SigLIPPooler(H_sig, mask) z_sig_patch = H_sig[patch_indices] return {"dino": z_dino, "siglip": (z_sig_summ, z_sig_patch)} |
Loss computation:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
def ComputeLoss(student_feats, teacher_feats, B_global): L = 0 for t in ["dino","siglip"]: s_sum, s_patch, s_reg = student_feats[t] t_sum, t_patch, t_reg = teacher_feats[t] L_patch = sum(‖s_p - t_p‖² / N_q for each image q) L_sum = sum(1 - cos(s_sum_q,t_sum_q)) L_reg = MSE(s_reg,t_reg) if t=="dino" else 0 D_t = pairwise_distances(t_sum) / scale D_s = pairwise_distances(s_sum) / scale m = median(D_t) L_arkd = mean(...) L += (L_patch + L_sum + L_reg + L_arkd) / B_global return L |
7. Significance and Implications
AMoE demonstrates that combining MoE architectures, multi-teacher distillation balanced by geometric preservation, token-balanced batch strategies, and hierarchical data curation results in state-of-the-art zero-shot and transfer performance at a substantially reduced training cost—4.7× fewer tokens compared to prior agglomerative baselines. The OpenLVD200M corpus and the distilled models are made publicly available, enabling further research into efficient large-scale vision model distillation (Chaybouti et al., 23 Dec 2025). A plausible implication is that agglomerative hierarchical sampling and ARKD loss will become standard in future large-scale vision distillation pipelines where compute efficiency and multi-source model unification are critical.