Ref-AVS: Multimodal Object Segmentation

Updated 29 September 2025

Ref-AVS is a task that segments specific video objects by aligning audio, visual, and linguistic inputs for precise localization.
Benchmark datasets like Ref-AVS Bench provide standardized, per-frame annotations that enable rigorous evaluation of multi-source referential grounding.
Recent methods leverage multi-stage architectures with explicit reasoning, fusion strategies, and token propagation to achieve state-of-the-art segmentation performance.

Referring Audio-Visual Segmentation (Ref-AVS) is defined as the task of segmenting specific objects in video data, guided by rich multi-modal queries—typically involving natural language expressions, audio signals, and visual context. Unlike classical segmentation or even standard audio-visual segmentation, the Ref-AVS formulation requires precise alignment of cross-modal cues for fine-grained localization and category prediction of the referred target, often in dynamic, cluttered environments with overlapping sources and ambiguous referents.

1. Problem Definition and Motivation

Ref-AVS seeks to produce pixel-wise segmentation masks for objects described by composite expressions that intertwine linguistic, aural, and visual information. This task represents a natural evolution from (i) video object segmentation (VOS), (ii) referring expression segmentation (using only language and visual data), and (iii) classical audio-visual segmentation, which segments all sounding entities without respect to specific natural-language instructions. Ref-AVS introduces multi-cue grounding, demanding that the model understand and fuse audio-visual correspondences, temporal scene context, and referential semantics. The motivation arises from the complex interplay of cues in human multimedia comprehension: users routinely employ spatial language, auditory signatures, and temporal context to refer to objects (e.g., “the person playing the loud trumpet standing leftmost on stage”) (Wang et al., 15 Jul 2024).

2. Benchmarks and Datasets

The introduction of the Ref-AVS Bench addresses the shortage of standardized datasets for this task. This benchmark consists of approximately 4,000 ten-second YouTube video clips (covering >11 hours) over 48 object categories. Each video snippet is annotated with per-frame, pixel-level ground truth and is paired with multimodal reference expressions, which are meticulously constructed to (i) guarantee uniqueness and referential necessity, and (ii) explicitly intertwine audio, visual, and/or temporal cues in natural language form. Reference expressions are crafted to cover a spectrum of complexity, supporting rigorous evaluation over three splits:

Seen: Categorical overlap with the training set.
Unseen: Probing model generalization to new object categories.
Null: Reference expressions that correspond to no object, measuring response sparsity and error rates for nonreferent queries.

Compared to earlier AVS-focused datasets, Ref-AVS Bench enforces multimodal-expressive disambiguation and goes beyond merely segmenting all sound sources, enabling research on the structured reasoning and fine-grained selection inherent to natural multimodal queries (Wang et al., 15 Jul 2024).

3. Methodological Advances

Recent methods for Ref-AVS have converged on multi-stage architectures that systematically fuse textual, auditory, and visual features before segmentation:

3.1. Multimodal Representation and Expression Enhancement

Modalities are processed by independent encoders—for instance, VGGish for audio, Swin Transformer for visuals, and RoBERTa for text—followed by dimensional alignment and concatenation. Each modality can be distinguished by specific tokens ([aud], [vis]) (Wang et al., 15 Jul 2024). Temporal bi-modal transformers are employed to enhance expression-aligned features, incorporating cached memory modules to capture the accumulated context and support difficult referential grounding in dynamic scenes.

Self-attention and cross-attention mechanisms in multimodal transformers are used to merge expression-enhanced features. Ref-AVS models (e.g., EEMC) form a joint representation: $Q_m = \text{MF}(\text{Concat}([Q_A; \text{[aud]}; Q_V; \text{[vis]}; Q_T]))$ where MF(·) is a self-attention-based fusion. Mask queries are then updated via cross-attention transformers conditioning on these multimodal reference vectors, providing a prompt for foundation visual segmentation models, such as Mask2Former or SAM2 (Wang et al., 2 Jun 2025).

3.3. Spatio-Temporal Consistency and Token Strategies

SAM2-LOVE introduces a learnable [seg] token that accumulates and propagates multimodal information across frames:

Token Propagation: Forward transfer of the segmentation token ensures that the evolving multimodal context is respected in sequential frames.
Token Accumulation: A backward flow [his] token is used to aggregate historical [cls] representations, preventing loss of earlier frame context and mitigating target drift in dynamic content (Wang et al., 2 Jun 2025).

3.4. Explicit Reasoning Agents

More recent agents such as TGS-Agent decompose the process into “Think-Ground-Segment”:

Think: An instruction-tuned multimodal LLM (Ref-Thinker) explicitly reasons and outputs an object-aware description, based on linguistic, visual, and audio analysis.
Ground: The explicit description is provided to an open-vocabulary detector (Grounding-DINO) to produce loose spatial localization (bounding boxes).
Segment: The grounded region is then refined by SAM2 to produce fine-scale pixel-wise segmentation—all using frozen or off-the-shelf pre-trained modules, bypassing the need for extensive pixel-level supervision (Zhou et al., 6 Aug 2025).

4. Training Objectives and Loss Functions

Ref-AVS frameworks employ multimodal alignment losses to ensure robust referential accuracy:

Target-Consistent Semantic Alignment Loss (SimToken): Aggregates embeddings from divergent expressions referring to the same target, enforcing a consistent semantic center. For video i: $L_{sa} = - \mathbb{E}_{p \in P_i} \left[ \log \frac{ \exp(q_i \cdot p / T) }{ \sum_{p' \in P_i} \exp(q_i \cdot p' / T) } \right]$ where qₖ and p are token representations, and T is a temperature parameter (Jin et al., 22 Sep 2025).
Segmentation Feature Distillation Loss (AURORA): Forces the student segmentation model to track the output of a pure segmentation teacher, preserving spatial precision during reasoning-driven training. Mean-squared error is minimized between the learned and ground-truth spatial feature maps (Luo et al., 4 Aug 2025).

Supervision strategies range from classic per-pixel cross-entropy and Dice loss to contrastive and mutual information regularization, depending on architecture. Instruction-tuning datasets with explicit chain-of-thought “think–answer” chains have also proven essential, especially for interpretable modules (e.g., Ref-Thinker in TGS-Agent).

5. Evaluation and Empirical Findings

Recent benchmarks report performance using the Jaccard index ( $\mathcal{J}$ ), F-score ( $\mathcal{F}$ ), or their mean (e.g., $\mathcal{J}\mathcal{F}$ ). The SimToken framework improved $\mathcal{J}\mathcal{F}$ by 29.0% on seen splits and 5.1% on unseen splits relative to SAM2-LOVE, and also demonstrated state-of-the-art robustness to non-referent expressions (null set) (Jin et al., 22 Sep 2025). SAM2-LOVE achieved an 8.5% increase in $\mathcal{J}\mathcal{F}$ over EEMC (Wang et al., 2 Jun 2025). TGS-Agent, which uses explicit reasoning and open-vocabulary detection, led the field on both Ref-AVSBench and the more linguistically and semantically challenging R\textsuperscript{2}-AVSBench (Zhou et al., 6 Aug 2025).

A table summarizing recent empirical results is shown below:

| Method | Dataset/Split | Mean $\mathcal{J}\mathcal{F}$ | Null-set $S$ (lower is better) | |:---------------|::--------------|:----------------|:---------------------------| | EEMC | Ref-AVSBench | — | Best on null set | | SAM2-LOVE | Ref-AVSBench | — | — | | SimToken | Ref-AVSBench | +29.0% (seen); +5.1% (unseen) | matches EEMC | | TGS-Agent | R\textsuperscript{2}-AVSBench | SOTA | Maintains performance |

All results trace directly to the text.

6. Challenges and Future Directions

Despite significant progress, Ref-AVS remains a demanding task due to multi-source ambiguity, open-vocabulary reference resolution, and cross-modal reasoning complexity. Key future directions include:

Temporal Reasoning: Improved modeling of temporal continuity where the target object(s) may disappear and re-emerge, drift, or overlap dynamically (Wang et al., 2 Jun 2025).
Explicit Reasoning and Interpretability: Structured chain-of-thought prompting, as in AURORA and TGS-Agent, enables more transparent decision processes and facilitates correction or debugging (Luo et al., 4 Aug 2025, Zhou et al., 6 Aug 2025).
Generalization to Unseen Categories/References: Benchmarks such as R\textsuperscript{2}-AVSBench and ablation analyses show that generalization across unseen object categories and non-trivial referring expressions remains a challenge.
Integration with Foundation Models: Incorporating frozen or lightly tuned foundation vision and LLMs (e.g., SAM2, VideoLLaMA2, CLIP) allows researchers to separate supervision for reasoning and segmentation, leveraging strong prior knowledge while minimizing the need for dense annotation (Wang et al., 2 Jun 2025, Luo et al., 4 Aug 2025, Zhou et al., 6 Aug 2025).
Cross-Modal Alignment Losses: Continued improvement of loss functions (e.g., semantic alignment, mutual information, distillation) is expected to drive finer cross-modal referencing and robustness to diversity in referring expressions (Jin et al., 22 Sep 2025).

Plausible implications are that advances in cross-modal semantic representation, interpretability of agent reasoning, and token-based prompt engineering will play an increasingly central role in bridging the gap between human referring strategies and automated segmentation in rich audio-visual environments.

7. Applications

Ref-AVS systems have immediate applications in interactive video editing, multimedia content retrieval, augmented reality, surveillance, and robotics—areas where spatially and temporally precise grounding of a user’s multimodal reference is essential. The ability to bridge natural language, complex audio, and multi-object scenes will enable next-generation systems for semantic multimedia indexing, situation awareness, and embodied agents capable of nuanced multimodal interaction.