Papers
Topics
Authors
Recent
Search
2000 character limit reached

Agglomerative Mixture-of-Experts Vision Models

Updated 30 December 2025
  • AMoE is a vision foundation model that unifies representations from distinct teacher models using a Mixture-of-Experts architecture and multi-teacher distillation.
  • It employs an asymmetric relation-knowledge distillation loss to preserve teacher geometry, resulting in improved zero-shot and kNN transfer performance.
  • Advanced token-balanced batching and agglomerative hierarchical clustering ensure efficient processing of multi-resolution data and balanced sample representation.

Agglomerative Mixture-of-Experts @@@@1@@@@ (AMoE) define a class of vision foundation models trained via multi-teacher distillation, integrating the representational strengths of distinct teacher models (specifically SigLIP2 and DINOv3) into a unified Mixture-of-Experts (MoE) student. AMoE advances multi-teacher distillation with a specialized asymmetric relation-knowledge distillation objective, token-balanced multi-resolution batching, and agglomerative data curation via hierarchical clustering and sampling, leading to markedly improved data efficiency, stable learning dynamics, and superior zero-shot and kNN transfer performance across visual tasks (Chaybouti et al., 23 Dec 2025).

1. Mixture-of-Experts (MoE) Student Architecture

The MoE student in AMoE consists of an 18-layer Transformer backbone with hidden dimension d=768d=768. Each MoE layer comprises 28 expert sub-networks, with a top-k=6k=6 routing policy: per input token, a lightweight router projects hRdh \in \mathbb{R}^d to logits over 28 experts, applies a softmax to compute routing probabilities pkp_k, and dispatches each token to its top-6 experts. This approach ensures only 6×6\times parameters are active per layer, enhancing parameter efficiency.

FlexAttention is employed to mask cross-image self-attention for sequences where multiple images are packed, isolating per-image computations. For teacher-specific representational adaptation, the shared backbone features are projected via distinct 1-layer MLP adapters into the SigLIP2 space (dimension 1152) and the DINOv3 space (dimension 1024). For SigLIP2, its attention-pooler is reused (kept frozen) to compute global summary embeddings.

The expert subnetwork of each MoE layer manifests as:

FFNMoE(h)=k=1Epk(h)FFNk(h),p=Softmax(Wrouterh)\mathrm{FFN}_{\mathrm{MoE}}(h) = \sum_{k=1}^{E} p_k(h) \mathrm{FFN}_k(h), \quad p = \mathrm{Softmax}(W_{\mathrm{router}}h)

where E=28E=28.

After adding a class token and 4 register tokens per image, token features xRNseq×dx \in \mathbb{R}^{N_{\mathrm{seq}} \times d} are processed as:

H=MoETransformer(x)RNseq×dH = \mathrm{MoETransformer}(x) \in \mathbb{R}^{N_{\mathrm{seq}} \times d}

and projected to teacher spaces:

Z^(dino)=HWdino,H^(sig)=HWsig\hat{Z}^{(\mathrm{dino})} = H W_{\mathrm{dino}}, \quad \hat{H}^{(\mathrm{sig})} = H W_{\mathrm{sig}}

For SigLIP2, H^(sig)\hat{H}^{(\mathrm{sig})} is partitioned into patches and summarized via the frozen SigLIP2 attention-pooler.

2. Asymmetric Relation-Knowledge Distillation Loss

Knowledge distillation in AMoE targets both global and local feature spaces per teacher. For each teacher t{dino,siglip}t \in \{\mathrm{dino}, \mathrm{siglip}\} and image qq:

  • CLS Loss (global summary):

LCLS(t)(q)=1cos(zq(t,s),z^q(t,s))\mathcal{L}^{(t)}_{\mathrm{CLS}}(q) = 1 - \cos(z^{(t,s)}_{q},\,\hat z^{(t,s)}_{q})

  • Patch Loss:

Lpatch(t)(q)=1Nq=1Nqzq,(t,p)z^q,(t,p)22\mathcal{L}^{(t)}_{\mathrm{patch}}(q) = \frac{1}{N_q} \sum_{\ell=1}^{N_q} \| z^{(t,p)}_{q,\ell} - \hat z^{(t,p)}_{q,\ell} \|_2^2

  • Register Loss (DINOv3 only):

Lreg(t)(q)=1t=dino1Kk=1Kzq,k(t,reg)z^q,k(t,reg)22\mathcal{L}^{(t)}_{\mathrm{reg}}(q) = \mathbf{1}_{t=\mathrm{dino}} \frac{1}{K} \sum_{k=1}^K \| z^{(t,reg)}_{q,k} - \hat z^{(t,reg)}_{q,k} \|_2^2

  • Total per-image loss:

L(t)(q)=LCLS(t)(q)+Lpatch(t)(q)+Lreg(t)(q)\mathcal{L}^{(t)}(q) = \mathcal{L}^{(t)}_{\mathrm{CLS}}(q) + \mathcal{L}^{(t)}_{\mathrm{patch}}(q) + \mathcal{L}^{(t)}_{\mathrm{reg}}(q)

Losses are averaged over the global batch. To further preserve the geometric structure of teacher embeddings, asymmetric relational knowledge distillation (ARKD) is applied. For image pairs (i,j)(i,j), the batchwise ARKD loss pushes or pulls student summary embeddings to match the pairwise teacher geometry, but splits "shrink" and "expand" directions at the teacher's pairwise median distance and applies Smooth-L1L_1 loss only as appropriate.

This asymmetric loss yields

  • Substantial improvements over both vanilla and symmetric relational KD in zero-shot and kNN accuracy.

3. Token-Balanced Batching for Multi-Resolution Data

Vision data exhibits large variation in native resolution, leading to per-image token counts ranging from 256\approx 256 (for 256×256256\times256 images) to 2304\approx 2304 (for 768×768768\times768 images). To address rank-to-rank (device-to-device) imbalance and unstable gradients, AMoE employs token-balanced batching, wherein images are dynamically packed together into per-device sequences constrained by a global maximum context length CmaxC_{\max} (e.g., 2304 tokens per sequence), as indicated by

1
2
3
4
5
6
7
8
9
10
11
12
function TokenBalancedBatch(images, C_max):
    batch = []
    curr_seq, curr_len = [], 0
    for img in images:
        L = 1 + 4 + num_patches(img)
        if curr_len + L > C_max:
            batch.append(curr_seq)
            curr_seq, curr_len = [], 0
        curr_seq.append(img)
        curr_len += L
    if curr_seq: batch.append(curr_seq)
    return batch

For each batch, FlexAttention masks cross-image attention; per-image losses are normalized, ensuring all images contribute equally regardless of resolution.

This strategy increases per-GPU throughput from 7.5k to 20k tokens/s and prevents catastrophic forgetting for low-resolution images, as established experimentally.

4. Agglomerative Hierarchical Clustering and Data Sampling

AMoE adopts hierarchical data curation, diverging from random web-scale sampling. The process, adapted from Vo et al. (2024), is as follows:

  • Embedding: Encode 2.3B images (sourced from LAION-5B and DFN) using DINOv3 ViT-B.
  • Subsampling and Clustering: Uniformly subsample 1B images. Learn a 4-level kk-means cluster hierarchy with designated centroid counts per level: 20M (L1), 500k (L2), 50k (L3), 20k (L4).
  • Assignment: Assign the remaining 1.3B images to L1 centroids.
  • Hierarchical Sampling: Sample 200M images by traversing the kk-means tree, uniformly selecting nodes at each level to flatten the long-tail distribution and ensure a balanced representation of fine-grained concepts.

Pseudocode (as published):

1
2
3
4
5
6
7
8
function HierarchicalSample(tree, N_target):
    selected = []
    while |selected| < N_target:
        node = tree.root
        while node is not leaf:
            node = random.choice(node.children)
        selected.append(random.choice(node.members))
    return selected

This yields the OpenLVD200M corpus—a 200M-image dataset with native resolutions from 64264^2 to 153621536^2, refined for balanced fine-grained coverage.

5. Experimental Protocol and Empirical Results

Models are trained across 32×A100 GPUs, using AdamW (β1=0.9\beta_1=0.9, β2=0.999\beta_2=0.999, ϵ=1015\epsilon=10^{-15}, weight decay 0.02), with a learning rate that warms up from 0 to 1×1031\times 10^{-3} (500 steps) and decays cosine-wise to 1×1041\times 10^{-4}. Distillation is performed in two stages: 256×256256\times256 images for 50k steps, then up to 768×768768\times768 for 90k steps. The full MoE student, with 0.6B parameters (0.3B active), sees 230B tokens in total, 4.7× fewer than RADIOv2.5's 1.1T.

Table: Zero-shot and kNN Top-1 Accuracy at 512×512

Method Budget (TT) INet (I–T) INet (kNN) C101 CUB Food Flow DTD Air
RADIOv2.5-H 1.1 78.69 79.96 88.69 81.47 94.09 88.23 69.57 70.32
AMoE (ensemble) 0.23 80.17 82.78 88.76 82.78 94.67 89.20 70.16 83.18

On retrieval (MSCOCO5k T2I/I2T, F30k T2I/I2T), AMoE consistently surpasses RADIOv2.5-H. On segmentation (Cityscapes/ADE20k/Pascal-VOC mIoU), AMoE matches or exceeds prior foundation models:

Method Cityscapes ADE20k Pascal-VOC
RADIOv2.5-H 64.11 51.13 85.65
AMoE 64.89 51.37 84.40

Ablation studies confirm

  • ARKD vs Symmetric RKD: ARKD delivers higher zero-shot and kNN accuracy ((80.2%,83.6%)(80.2\%,83.6\%) vs (79.5%,82.6%)(79.5\%,82.6\%) and vanilla MT’s (62.0%,81.6%)(62.0\%,81.6\%)).
  • Data Curation: OpenLVD200M increases image–text avg (74.96%79.11%74.96\% \to 79.11\%), kNN avg (82.66%85.08%82.66\% \to 85.08\%), and retrieval T2I@1 (57.63%59.14%57.63\% \to 59.14\%) relative to random 200M sampling.
  • Token-Balanced Batching: Improves low-resolution performance and theoretical compute throughput.

6. Algorithmic Workflows and Implementation

The forward-pass workflow and per-batch loss computation are summarized in the published pseudocode:

Forward pass:

1
2
3
4
5
6
7
8
def StudentForward(packed_tokens, mask):
    x = AddSpecialTokens(packed_tokens)
    H = MoETransformer(x, mask)
    z_dino = H @ W_dino
    H_sig = H @ W_sig
    z_sig_summ = SigLIPPooler(H_sig, mask)
    z_sig_patch = H_sig[patch_indices]
    return {"dino": z_dino, "siglip": (z_sig_summ, z_sig_patch)}

Loss computation:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
def ComputeLoss(student_feats, teacher_feats, B_global):
    L = 0
    for t in ["dino","siglip"]:
        s_sum, s_patch, s_reg = student_feats[t]
        t_sum, t_patch, t_reg = teacher_feats[t]
        L_patch = sum(s_p - t_p² / N_q for each image q)
        L_sum = sum(1 - cos(s_sum_q,t_sum_q))
        L_reg = MSE(s_reg,t_reg) if t=="dino" else 0
        D_t = pairwise_distances(t_sum) / scale
        D_s = pairwise_distances(s_sum) / scale
        m = median(D_t)
        L_arkd = mean(...)
        L += (L_patch + L_sum + L_reg + L_arkd) / B_global
    return L

7. Significance and Implications

AMoE demonstrates that combining MoE architectures, multi-teacher distillation balanced by geometric preservation, token-balanced batch strategies, and hierarchical data curation results in state-of-the-art zero-shot and transfer performance at a substantially reduced training cost—4.7× fewer tokens compared to prior agglomerative baselines. The OpenLVD200M corpus and the distilled models are made publicly available, enabling further research into efficient large-scale vision model distillation (Chaybouti et al., 23 Dec 2025). A plausible implication is that agglomerative hierarchical sampling and ARKD loss will become standard in future large-scale vision distillation pipelines where compute efficiency and multi-source model unification are critical.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Agglomerative Mixture-of-Experts Vision Foundation Models (AMoE).