Cross-Modal Semantic Sieve (CS2)

Updated 12 March 2026

The paper introduces a dual-framework—one for dynamic audio-visual token reduction and another for semantic dataset pruning—enhancing multimodal model performance.
It employs a unified cross-modal transformer encoder to adaptively allocate tokens based on content importance, achieving significant speedups and accuracy improvements.
The approach leverages caption generation with semantic similarity to filter image-text pairs, addressing the limitations of unimodal methods.

The Cross-Modal Semantic Sieve (CS2) refers to two distinct but conceptually parallel frameworks in large-scale multimodal learning: (1) a dynamic audio-visual token selection module for efficient Audio-Visual LLMs (AV-LLMs) as introduced in “EchoingPixels” (Gong et al., 11 Dec 2025), and (2) a semantic dataset pruning signal for filtering vision-language pairs as developed in “Sieve: Multimodal Dataset Pruning Using Image Captioning Models” (Mahmoud et al., 2023). Both utilize joint cross-modal signal processing to improve data or token efficiency, and both have demonstrated substantial empirical improvements over unimodal or unidimensional selection methods.

1. Overview and Core Purpose

The first variant of CS2, as realized in EchoingPixels (Gong et al., 11 Dec 2025), is a token reduction module designed for AV-LLMs that dynamically fuses and sparsifies audio and video tokens by leveraging joint cross-modal attention. Rather than compressing audio and visual modalities independently, CS2 operates on a single audio-visual token pool and adaptively selects the top- $k$ most salient tokens, thereby allocating the token budget across modalities based on instantaneous content importance. This approach overcomes the limitations of fixed, unimodal token budgets in heterogeneous or dynamic multimodal streams.

In the dataset pruning context (Mahmoud et al., 2023), CS2 denotes a signal for evaluating and ranking image–text pairs using synthetic captions generated from a pretrained captioning model and computing their semantic similarity to the associated alt-text. Here, CS2 acts as a cross-modal validation sieve, retaining only highly-aligned pairs for downstream training.

2. Architecture and Mathematical Formulation (EchoingPixels CS2)

In EchoingPixels, CS2 is situated early in the AV-LLM pipeline, immediately following initial audio and visual encoding. Given $T_v \in \mathbb{R}^{L_v \times D}$ (video tokens), $T_a \in \mathbb{R}^{L_a \times D}$ (audio tokens), and $T_t \in \mathbb{R}^{L_t \times D}$ (text instruction tokens), the module concatenates these streams:

$T = [T_v; T_a; T_t] \in \mathbb{R}^{(L_v+L_a+L_t) \times D}$

A bidirectional cross-modal Transformer encoder $\mathcal{E}$ processes $T$ :

$T' = \mathcal{E}(T)$

The first $L_v+L_a$ tokens (audio and video) are assigned importance scores via a learned MLP:

$s_i = \mathrm{MLP}(T'_i), \quad i \in \{1, \dots, L_v+L_a\}$

With global compression ratio $p$ , the top $k=\lfloor p(L_v+L_a) \rfloor$ tokens by $s_i$ are selected; instruction tokens are always preserved. A hard Top- $K$ selection is used, with the straight-through estimator (STE) surrogating the non-differentiable Top- $K$ operation:

$\frac{\partial y_i}{\partial s_i} \approx 1, \quad y_i \in \{0, 1\}$

This design results in a single, adaptively allocated pool of salient multimodal tokens for the downstream LLM decoder.

3. Dynamic, Unified Token Allocation

A key feature of CS2 in EchoingPixels is its dynamic allocation of the global token budget. When, for example, the audio stream is silent, nearly all selected tokens are drawn from the visual channel; conversely, if the video is static but the audio is informative, the allocation inverts. This adaptivity sidesteps the inefficiency of static, per-modality quotas and endows the model with improved sensitivity to contextual modality salience, as evidenced by empirical ablations: fixed per-modality quotas degrade overall accuracy by 2.4 points at a 20% token budget (Gong et al., 11 Dec 2025).

4. Semantic Sieve for Dataset Pruning (Sieve CS2)

In dataset pruning, the Cross-Modal Semantic Sieve operates by evaluating the alignment between noisy web-crawled image–text pairs. For each tuple $(x_i, t_i)$ , where $x_i$ is an image and $t_i$ is alt-text, a pretrained image captioner $G$ generates a set of synthetic captions $\{c_i^{(j)}\}$ . Boilerplate phrases are masked out via $M(\cdot)$ . Semantic similarity scores are then computed as:

$s_i = \max_{j=1\ldots r} \cos\big(E_{LM}(M(c_i^{(j)})),\; E_{LM}(M(t_i))\big)$

where $E_{LM}$ is a lightweight sentence-level embedding model. Image-text pairs are ranked by $s_i$ ; top- $k$ pairs are retained. This pipeline (image $\rightarrow$ caption $\rightarrow$ text similarity) addresses systematic CLIPScore failure modes, such as noisy correlations and rare but correct labelings, achieving substantial improvements in both classification and retrieval on DataComp: e.g., 2.6 percentage points average improvement when fused with CLIPScore at medium scale (Mahmoud et al., 2023).

5. Integration with Auxiliary Temporal and Semantic Modules

In EchoingPixels, following token selection, compressed sequences are processed with Synchronization-Augmented Rotary Position Embeddings (Sync-RoPE). Standard RoPE relies solely on high-frequency positional encoding, which is susceptible to loss of temporal monotonicity after sparsification. Sync-RoPE re-partitions the $d$ -dimensional positional encoding across $[t, h, w, t]$ , assigning low-frequency temporal signals at both sequence boundaries, ensuring robust temporal distance encoding even as token density varies. Drops in sequence modeling performance upon ablation of Sync-RoPE reinforce its necessity under aggressive CS2 pruning (Gong et al., 11 Dec 2025).

In the Sieve context, semantic textual similarity (STS) encoders used for cross-text alignment are not re-aligned with CLIP space, as their sole function is robust text-to-text comparison, enabling the system to bridge the inherent diversity gap between generated captions and diverse web-derived alt-texts (Mahmoud et al., 2023).

6. Empirical Performance and Ablation

In EchoingPixels, with Qwen2.5-Omni-3B backbone, CS2 achieves:

At 20% token budget ( $p=0.2$ ): 99.0% of full-model accuracy, 2.23× speedup, 2.26× memory reduction
At 10%: 95.2% relative accuracy, 2.67× speedup, 2.45× memory reduction
At 5%: 91.0% relative accuracy, 2.96× speedup, 2.61× memory reduction

Key ablations demonstrate that removing cross-modal attention or replacing the unified pool approach with per-modality quotas substantially reduces performance (by up to 2.9 and 2.4 points, respectively). Sync-RoPE is essential for optimal temporal and event alignment (Gong et al., 11 Dec 2025).

In Sieve CS2, DataComp experiments demonstrate that Sieve alone outperforms CLIPScore on retrieval-centric tasks (e.g., 28.9% vs 25.1% at medium scale; 52.0% vs 46.6% at large scale), with further gains via late-fusion of both scores (Mahmoud et al., 2023).

7. Significance, Failure Modes, and Extensions

Both uses of CS2 address inherent limitations in unimodal or static-prior approaches within multimodal learning. In EchoingPixels, the cross-modal fusion and unified selection strategy allow for granular, context-sensitive token budget allocation, which is particularly critical in scenarios with fluctuating information density across modalities. The dataset pruning version mitigates systematic scoring errors (false positives and negatives) of CLIPScore by decoupling the image–text assessment into a caption generation–similarity computation pipeline.

Challenges remain in both settings: for EchoingPixels, the main open questions involve scaling to longer or more diverse temporal contexts, and the robustness of STE-based top- $k$ selection. For dataset pruning, the diversity limitations of compact captioning models and STS models’ sensitivity to phrasing suggest that more powerful generative and semantic encoders, or complementary filtering mechanisms (e.g., image-only heuristics, class priors) may yield further improvements (Mahmoud et al., 2023).

Taken together, CS2 methods exemplify the advantages of early, dynamic, and truly cross-modal fusion for both efficient large-model inference and data quality enhancement in multimodal learning.

Markdown Report Issue Upgrade to Chat

References (2)

EchoingPixels: Cross-Modal Adaptive Token Reduction for Efficient Audio-Visual LLMs (2025)

Sieve: Multimodal Dataset Pruning Using Image Captioning Models (2023)

Topic to Video (Beta)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Cross-Modal Semantic Sieve (CS2).