Cross-Modal Compositional Self-Distillation

Updated 25 November 2025

The paper presents a novel approach that distills cross-modal knowledge from full to partial modalities, achieving state-of-the-art performance in both brain tumor segmentation and vision-language tasks.
It combines a shared-specific encoder-decoder architecture with hierarchical (HMSD) and decremental (DMCD) self-distillation strategies to maintain robust predictions under missing modality conditions.
Empirical evaluations demonstrate significant improvements in Dice scores and zero-shot retrieval metrics, highlighting the framework’s practical impact and potential for future multi-modal applications.

Cross-Modal Compositional Self-Distillation (CCSD) refers to a family of frameworks that unify cross-modal feature learning and self-distillation to improve the robustness, generalization, and compositionality of deep networks under modality-missing conditions. The term specifically describes techniques applied to both medical image segmentation with missing MRI modalities and self-supervised vision-language pretraining, as instantiated in independently developed frameworks by Wang et al. for MRI segmentation (Xie et al., 18 Nov 2025) and Liu et al. for vision-LLMs (as COSMOS) (Kim et al., 2024). The central idea is to hierarchically and/or compositionally distill cross-modal knowledge through internal alignment mechanisms, ensuring robust predictions under arbitrary subsetting of input modalities.

1. Architectural Foundations

Medical Segmentation (Brain Tumors with Missing MRI)

The CCSD framework deploys a “shared-specific” encoder-decoder backbone. Each input MRI modality $x^m$ is simultaneously processed by a global shared encoder $E_{\mathrm{shared}}(x^m)$ producing modality-invariant features $f_{\mathrm{shared}}^m$ , and by a modality-specific encoder $E_{\mathrm{spec}}^m(x^m)$ yielding domain-specific features $f_{\mathrm{spec}}^m$ . These are fused via a learnable compositional module $C(\cdot)$ to obtain $f_{\mathrm{fused}}^m$ . For any subset of modalities $\mathcal{S}_i$ , the concatenation of $f_{\mathrm{fused}}^m$ for present modalities ( $m\in\mathcal{S}_i$ ) and $f_{\mathrm{shared}}^m$ for missing ones ( $m\notin\mathcal{S}_i$ ) forms the high-dimensional vector $Z_{\mathcal{S}_i}$ . A shared decoder $D_{\mathrm{shared}}$ outputs the final segmentation map $P_{\mathcal{S}_i}$ (Xie et al., 18 Nov 2025).

Vision-Language Pretraining

COSMOS introduces cross-modal self-distillation atop standard dual-encoder architectures. A student model receives global and local crops of images and texts, processes them via image encoder $I_{\theta_s}$ and text encoder $T_{\theta_s}$ , and applies lightweight cross-attention modules ( $C_t^s$ , $C_i^s$ ) to produce cross-modal tokens (e.g., $h_I$ and $h_T$ ). The teacher, produced by EMA of student weights, processes only global crops. The cross-modal embeddings generated by the student on local crops are then distilled to match the teacher’s global outputs (Kim et al., 2024).

2. Self-Distillation Strategies

Hierarchical Modality Self-Distillation (HMSD)

In brain tumor segmentation, the HMSD mechanism aligns predictions from all subsets of modalities to the output of the full-modality “teacher” via temperature-scaled KL divergence. Given $P_T$ for all modalities and $P_{\mathcal{S}_i}$ for each subset $\mathcal{S}_i$ , the HMSD loss aggregates:

$\mathcal{L}_{\mathrm{HMSD}} = \frac{1}{|\mathcal{S}_N^k|} \sum_{\mathcal{S}_i\in\mathcal{S}_N^k} D_{\mathrm{KL}}\bigl(P_T\|\;P_{\mathcal{S}_i}\bigr)$

where $\tau$ is the temperature and $k$ indexes the hierarchy level (Xie et al., 18 Nov 2025).

Decremental Modality Combination Distillation (DMCD)

DMCD simulates worst-case progressive modality loss. For each modality in a combination, a “criticality score” based on feature cosine similarity is computed; the most critical modality is iteratively removed, producing a decremental path. Distillation occurs sequentially:

$\mathcal{L}_{\mathrm{DMCD}} = \sum_{k=2}^{N} D_{\mathrm{KL}}\bigl(\sigma(Z_{\mathcal{S}_k}/\tau) \|\; \sigma(Z_{\mathcal{S}_{k-1}}/\tau)\bigr)$

with $\sigma(\cdot)$ a softmax over feature channels (Xie et al., 18 Nov 2025).

Cross-Modality Self-Distillation in VLMs

COSMOS applies multi-view self-distillation: local (fine-grained) student cross-modal representations are aligned with teacher outputs from global crops using a symmetric InfoNCE loss. The objective is

$L_{\text{total}} = L_{\text{CLIP}} + L_{\text{COSMOS}}$

where $L_{\text{CLIP}}$ is the standard dual-encoder CLIP loss, and $L_{\text{COSMOS}}$ symmetrically distills cross-modal tokens between student and teacher across global/local image-text crops (Kim et al., 2024).

3. Training Objectives and Optimization

In brain tumor segmentation, the full objective combines cross-entropy (or Dice) segmentation loss with both HMSD and DMCD terms:

$\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{seg}} + \alpha \mathcal{L}_{\mathrm{HMSD}} + \beta \mathcal{L}_{\mathrm{DMCD}}$

Typically, $\alpha = \beta = 1$ . Training uses Adam optimizer (initial LR $10^{-2}$ ), cosine schedule, and standard data augmentation; missing modalities are simulated via zero-masking (Xie et al., 18 Nov 2025).

In COSMOS, parameter updates employ AdamW, large-batch regime ( $B = 1,024$ –$4,096$), and extended multi-modal augmentations. Teacher weights are updated via EMA ( $\lambda = 0.999$ or $0.99$). There are no hyper-parameter tunings for the weight between CLIP and self-distillation losses (Kim et al., 2024).

4. Empirical Evaluation and Results

Brain Tumor Segmentation (BraTS 2018/2020)

CCSD achieves state-of-the-art performance under missing-modality scenarios, surpassing ShaSpec, M3AE, and MIFPN backbones. For BraTS 2018, averaged (15 combinations) Dice scores for Enhancing Tumor, Tumor Core, Whole Tumor, and mean are as follows:

Method	ET	TC	WT	Avg.
mmFm	59.85	72.97	82.94	71.25
ShaSpec	60.69	77.93	85.65	74.09
M3AE	59.85	77.37	85.82	74.35
MIFPN	60.77	74.63	84.63	73.34
CCSD	62.70	78.23	86.47	75.80

BraTS 2020 mean Dice: CCSD 78.56, with best scores in all three regions (Xie et al., 18 Nov 2025).

Ablations show both HMSD and DMCD are necessary for optimal robustness; discarding either produces up to 2% average performance drop. The area under robustness curve (AURC) analysis confirms CCSD’s graceful degradation as modalities are lost.

Vision-Language Benchmarks

COSMOS advances zero-shot retrieval on MSCOCO by substantial margins over DreamLIP, CLIP, and OpenCLIP–1B, reaching R@1 of 68.0% (I2T, Merged–30M pretraining) (Kim et al., 2024). Average top-1 zero-shot classification is 58.6% (Merged–30M). Zero-shot segmentation (mIoU) exceeds OpenCLIP–1B by 3.5 absolute points (20.0% vs 16.5%). Ablations reveal that the combination of multi-crop, cross-attention, and cross-modal self-distillation is necessary to achieve optimal performance, with each term yielding incremental gains.

GPU cost scales linearly with the number of local views; six local crops yield highest retrieval/accuracy metrics.

5. Interpretability and Feature Analysis

In the segmentation domain, feature disentanglement into shared and specific components, with fusion at early decoder stages, encourages the model to “bridge” semantic gaps between partial and full modality sets. The decremental distillation path explicitly teaches the network to transfer knowledge from richer (multi-modal) to weaker (partial-modal) settings, supporting compensation for missing critical modalities (Xie et al., 18 Nov 2025).

For vision-LLMs, cross-attention and multi-crop augmentations yield stronger grounding and a less foreground-biased representation, addressing the feature suppression limitations of standard CLIP-style contrastive learning. This suggests that compositional self-distillation strategies are generally effective in making cross-modal representations more robust to partial/missing data, label sparsity, or local-view occlusion (Kim et al., 2024).

6. Limitations and Future Directions

CCSD in brain tumor segmentation relies on cosine-similarity-based criticality for decremental paths; alternate importance measures (e.g., learned or attention-derived) could enhance selectivity. There remains a gap between single- and multi-modality performance in worst-case settings. Extensions to more sophisticated fusion modules (e.g., attention-based) and application to other imaging tasks (e.g., PET–CT, organ-wise MRI) are proposed (Xie et al., 18 Nov 2025).

In cross-modal pretraining, computational efficiency is bounded by the number of local crops; the method is contingent on the diversity of synthetic long captions (for text) and augmentation strategies. A plausible implication is that optimizing self-distillation regimes for domain-specific challenges remains an open area for research (Kim et al., 2024).

7. Summary and Synthesis

Cross-Modal Compositional Self-Distillation frameworks provide a unified approach for robust, flexible learning under partial/missing modality conditions. By integrating feature disentanglement, hierarchical and pathwise self-distillation, and multi-view cross-modal alignment, these methods set new benchmarks in both medical image segmentation and vision-language understanding. The general principle—compositional internal knowledge transfer across modalities and views—demonstrates strong empirical improvements in robustness and generalization, with ongoing directions exploring theoretical underpinnings, optimization of distillation paths, and transfer to new domains (Xie et al., 18 Nov 2025, Kim et al., 2024).

PDF Markdown Chat (Pro)

References (2)

CCSD: Cross-Modal Compositional Self-Distillation for Robust Brain Tumor Segmentation with Missing Modalities (2025)

COSMOS: Cross-Modality Self-Distillation for Vision Language Pre-training (2024)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Cross-Modal Compositional Self-Distillation (CCSD).