Cross-Modal Complementarity Screening

Updated 21 December 2025

Cross-Modal Complementarity Screening is a strategy that filters multimodal data to retain only samples where every modality is indispensable for achieving high performance.
It employs a quantitative complementarity margin and ablation testing to systematically eliminate tasks that can be solved using unimodal shortcuts.
Applications span benchmark construction, materials informatics, and robotics, demonstrating enhanced accuracy and efficiency through enforced fusion-dependent reasoning.

Cross-Modal Complementarity Screening (CMCS) is a principled strategy within multimodal machine learning and benchmark construction to ensure that each retained sample or task instance is genuinely fusion-dependent, i.e., solvable only by integrating non-redundant, essential information from multiple heterogeneous modalities. CMCS explicitly quantifies and enforces the necessity for true cross-modal reasoning, systematically eliminating samples amenable to unimodal shortcuts. This concept has become central in large-scale benchmark creation (e.g., FysicsWorld), efficient materials informatics (COFAP), medical screening (CC-CMR), and contextually adaptive robotics and sentiment analysis frameworks.

1. Core Principles and Motivation

The emergence of multimodal LLMs (MLLMs) and omni-modal architectures has exposed a fundamental limitation in benchmark and application design: the prevalence of modality redundancy. Existing datasets often combine image, audio, video, and text without ensuring that all supplied modalities are truly necessary. Standard “stitching” approaches allow current models to answer tasks by leveraging only a subset of modalities, undermining the ambition of robust cross-modal fusion and multi-sensor generalization.

CMCS was developed to counteract this redundancy by actively screening instances to retain only those where every modality provides indispensable, complementary evidence. Only instances where ablating any single modality leads to a significant quantitative drop in model performance are preserved. This enforces fusion-dependent reasoning and increases benchmark and model rigor (Jiang et al., 14 Dec 2025).

2. Theoretical Formulation

CMCS is formalized over a collection of candidate samples, each described as a tuple of $M$ modalities, $s = (m_1, m_2, ..., m_M)$ . For a strong MLLM or ensemble $\mathcal{M}$ , the performance (e.g., accuracy or task score) on the complete input is denoted $\mathrm{Acc}_{\mathrm{full}}(s) = \mathrm{Performance}(\mathcal{M}, m_1, ..., m_M)$ . For each modality, construct an ablated sample $s_{[-i]}$ by removing $m_i$ and compute $\mathrm{Acc}_{[-i]}(s) = \mathrm{Performance}(\mathcal{M}, s_{[-i]})$ . The complementarity margin is

$\Delta(s) = \min_{i=1,\dots,M} \left[ \mathrm{Acc}_{\mathrm{full}}(s) - \mathrm{Acc}_{[-i]}(s) \right].$

Samples are retained if $\Delta(s) \geq \tau$ for a chosen threshold $\tau > 0$ : $\mathrm{CMCS\text{-}keep} = \{ s \in S: \Delta(s) \geq \tau \}.$ This process can be generalized to any $M$ by iterating over all single-modality ablations. The selection threshold $\tau$ quantifies the minimal acceptable complementarity necessary for sample inclusion (Jiang et al., 14 Dec 2025).

3. Algorithmic Realization

CMCS is machine-driven and systematic. The high-level procedure:

For each candidate sample $s$ in the initial pool, compute the ensemble-averaged performance on the full input ( $\mathrm{Acc}_{\mathrm{full}}$ ), using multiple evaluators (e.g., GPT-5, Gemini-2.5-Pro).
For each modality $m_i$ , evaluate performance with that modality ablated ( $\mathrm{Acc}_{[-i]}$ ).
Compute the complementarity margin $\Delta(s)$ .
Retain the sample if $\Delta(s) \geq \tau$ .

This approach is integrated after context/question generation and before final manual inspection or downstream processing. Only CMCS-filtered, fusion-dependent tasks are propagated into certain benchmark subsets (e.g., FysicsWorld-Omni).

A representative pseudocode excerpt:

For each sample s in S_raw:
    Acc_full = average_over_models(Performance(model, s_full))
    For i = 1 to M:
        Acc_ablated[i] = average_over_models(Performance(model, s_{[-i]}))
    Δ_min = min(Acc_full - Acc_ablated[i])
    If Δ_min >= τ:
        retain s

Parameter selection (e.g.,

\tau

) is tuned via pilot studies to balance diversity versus rigor, and at least two strong evaluators are recommended to mask model-specific idiosyncrasies (Jiang et al., 14 Dec 2025).

4. Applications Across Domains

Benchmark Construction

CMCS is central to FysicsWorld, which requires that all tasks in its “Omni” subset be fusion-dependent. Ablations show that without CMCS, approximately 30% of candidate fusion tasks are answerable unimodally; with CMCS, this drops to ∼12%. Tasks so selected are demonstrably harder, with leading OmniLLMs dropping by 10–15 accuracy points compared to their unimodal baselines (Jiang et al., 14 Dec 2025).

Materials Informatics

In COFAP, complementary structural, topological, and chemical “views” are extracted by frozen deep encoders (SP-cVAE, PH-NN, BiG-CAE) and fused by cross-modal attention. CMCS is realized by cross-attention mechanisms that ensure each modality substantially contributes to predictive accuracy—ablation studies show that each modality adds distinct information, and the full three-way fusion yields best-in-class performance on unseen COF datasets (Li et al., 3 Nov 2025).

Domain	Modalities	CMCS Implementation
Benchmarking (FysicsWorld)	Audio, Image, Video, Text	Marginal accuracy drop under ablation
Materials informatics (COFAP)	Geometry, Topology, Chemistry	Cross-attention weighting and ablation
Sentiment analysis (VCAN)	Video, Audio	Audio-driven keyframe selection
Robotics (CMA)	Vision, Tactile, Proprio	Transformer-based modality attention

Sentiment Analysis

The Cross-Modal Selection Module (CMSM) in VCAN leverages audio features (from MEL-spectrograms) as auxiliary signals to select a minimal subset of key video frames. Selection is based on deterministic scoring (variance of pitch statistics), providing a lightweight instantiation of CMCS by filtering redundant visual data while preserving emotionally-relevant moments. This yields a substantial reduction in computational cost and empirically outperforms alternative selection methods (Chen et al., 2022).

Robotics and Skill Segmentation

Cross-Modality Attention (CMA) modules apply CMCS by forming joint representations over all modalities and timesteps. Attention scores quantify momentary complementarity, and the system can “screen” and selectively utilize only those modalities with the highest data-driven attention weights for each task primitive. Empirical results show interpretability gains (skill-to-modality assignment), segmentation of tasks, and 5–10× higher sample efficiency in policy training (Jiang et al., 20 Apr 2025).

5. Experimental Evidence and Ablation Studies

Empirical validation is crucial for assessing CMCS effectiveness:

In FysicsWorld, human annotator agreement with CMCS retention was 92%. Lowering $\tau$ from 0.05 to 0.02 increased recall but reduced average complementarity; increasing $\tau$ decreased sample diversity (Jiang et al., 14 Dec 2025).
In COFAP, full cross-modal fusion consistently outperformed ablated (modality-removed) models across adsorption and separation metrics. For example, $R^2$ for VSA separation on unseen COFs attained 0.9446, surpassing prior models requiring explicit gas-related descriptors (Li et al., 3 Nov 2025).
In VCAN, CMSM-based selection reached ACC-7 = 0.79 on RAVDESS (vs. 0.50–0.61 for comparators) and reduced processing by a factor of ≥6 (Chen et al., 2022).
In robotic policy learning, CMA yielded success rates of 96%, with per-primitive training converging an order of magnitude faster than end-to-end baselines, demonstrating that data-driven complementarity screening substantially improves efficiency and final task separation (Jiang et al., 20 Apr 2025).

6. Limitations, Extensions, and Best Practices

CMCS, while rigorous and model-driven, has several limitations:

Evaluator dependency: Effectiveness depends on whether the underlying models fully exploit relevant modalities—if not, genuine fusion may be missed.
Single-modality ablations: Current implementations typically ablate one modality at a time; some tasks may require combinatorial or pairwise ablations to fully capture inter-modality dependencies.
Computational cost: The approach scales linearly with the number of modalities and evaluators.
Threshold effects: Tuning $\tau$ is critical; high values prune diversity, low values admit redundancy.

Extensions proposed include pairwise or higher-order ablation screening, continuous-valued complementarity scores (as difficulty meters), and generalization to new modalities such as haptics or 3D point clouds (Jiang et al., 14 Dec 2025). In application, best practices advocate the use of multiple high-capacity evaluator models, careful threshold pilot studies, and, in niche domains, manual or expert annotation to supplement model-based screening.

7. Broader Impact and Future Directions

CMCS establishes a robust, model-centric standard for task and sample selection in multimodal AI development, shifting the field from ad hoc data fusion to principled, empirically validated fusion dependency. Its adoption generates more challenging and diagnostically rich benchmarks (as in FysicsWorld), enables interpretable and efficient cross-modal architectures (COFAP, VCAN, CMA), and offers a generalized, domain-agnostic toolkit for scientific, medical, and robotic applications. Future work is likely to focus on unsupervised discovery of multi-way dependency patterns, continuous calibration of complementarity, and integration with active learning and lifelong agent scenarios (Jiang et al., 14 Dec 2025, Li et al., 3 Nov 2025, Jiang et al., 20 Apr 2025).