Sa2VA: Unified Dense Vision-Language Segmentation

Updated 4 July 2026

Sa2VA is a unified framework for dense grounded understanding that integrates a multimodal LLM with SAM-2 for pixel-level segmentation guided by natural language.
It enables tasks such as referring image/video segmentation, grounded captioning, and conversational interaction via its distinctive [SEG] token interface.
Extensions like Sa2VA-i, SaSaSa2VA, and audio-adapted variants refine temporal modeling, resolve training–inference gaps, and adapt the approach to domain-specific challenges.

Searching arXiv for papers on Sa2VA and closely related variants to ground the article in current literature. Sa2VA is a name used for multiple research systems in contemporary arXiv literature, but its dominant usage refers to a multimodal framework for dense grounded understanding of images and videos that couples a large multimodal LLM with SAM2 for pixel-level segmentation from natural-language instructions (Yuan et al., 7 Jan 2025). In that sense, Sa2VA is a unified model for tasks such as referring image segmentation, referring video object segmentation, grounded captioning, and grounded conversation, with a special instruction token, [SEG], linking language-conditioned reasoning to mask generation (Yuan et al., 7 Jan 2025). Subsequent work has extended, modified, or repurposed this framework in several directions, including temporally enhanced video segmentation in SaSaSa2VA (Niu et al., 21 Sep 2025), training–inference consistency corrections in Sa2VA-i (Nekrasov et al., 23 Sep 2025), audio-conditioned referring segmentation through ASR-SaSaSa2VA (Wang et al., 27 Apr 2026), and domain-specific medical adaptation in ARIADNE (Jin et al., 19 Mar 2026). The name also appears in unrelated contexts, including Sound Adversarial Audio-Visual Navigation, abbreviated in one source as “Sa2VA” (Yu et al., 2022), and an autonomous-driving anomaly framework where “Sa2VA” is given as an alternative shorthand for SAVANT (Brusnicki et al., 20 Oct 2025). This ambiguity makes disambiguation essential when interpreting the term in current literature.

1. Terminological scope and disambiguation

In the literature on dense grounded vision–language modeling, Sa2VA most commonly denotes “Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos” (Yuan et al., 7 Jan 2025). In this usage, Sa2VA is a unified model for dense grounded understanding of both images and videos, designed to support referring segmentation and conversation with minimal one-shot instruction tuning (Yuan et al., 7 Jan 2025).

A distinct but related line of work treats Sa2VA as the baseline backbone for referring video object segmentation systems. “4th PVUW MeViS 3rd Place Report: Sa2VA” uses Sa2VA-26B, built from InternVL2.5-26B and SAM-2, as a challenge system and modifies only test-time key-frame selection (Yuan et al., 1 Apr 2025). “Sa2VA-i” explicitly describes itself as an improved version of Sa2VA that rectifies training–inference inconsistencies while using the same Sa2VA checkpoints (Nekrasov et al., 23 Sep 2025). “SaSaSa2VA” is presented as an enhanced version of Sa2VA that adds Key Frame Compression and multiple [SEG] tokens for stronger temporal modeling (Niu et al., 21 Sep 2025). “ASR-SaSaSa2VA” then repurposes SaSaSa2VA as a text-based segmentation module within an audio-to-text pipeline for audio-guided video segmentation (Wang et al., 27 Apr 2026).

Other papers use the same string differently. One source states that “Sa2VA” in the query refers to Sound Adversarial Audio-Visual Navigation, a two-player zero-sum game for audio-visual navigation with an explicit sound attacker (Yu et al., 2022). Another states that SAVANT and “Sa2VA” are the same framework for semantic anomaly detection in autonomous driving (Brusnicki et al., 20 Oct 2025). These usages are methodologically unrelated to the dense grounded segmentation line.

This distribution of meanings suggests that, in current technical practice, “Sa2VA” is not a globally unambiguous identifier. A plausible implication is that the term should be interpreted from venue, task, and cited title rather than from the acronym alone.

2. Core formulation in dense grounded understanding

In its primary sense, Sa2VA is a simple, unified, end-to-end model that marries SAM-2 with a LLaVA-style multimodal LLM to achieve dense, grounded understanding across both images and videos (Yuan et al., 7 Jan 2025). Its stated objective is to unify text, image, and video in a shared LLM token space, so that the LLM can emit instruction tokens guiding SAM-2 toward precise masks (Yuan et al., 7 Jan 2025).

The model takes a text prompt and either an image or a video, optionally with visual prompts, and produces text and/or masks:

$T_o, M_o = \text{LLM}(\{I_i, V_i, VP_i\}, T)$

as summarized in the original paper (Yuan et al., 7 Jan 2025). The central mechanism is the special [SEG] token. Given text tokens and visual tokens, the LLM produces a last-layer hidden state at the [SEG] position,

$I = f_\theta(T, V) = h_{\text{SEG}},$

which is mapped by a small MLP into a SAM-2 prompt embedding,

$P_l = W_2 \sigma(W_1 h_{\text{SEG}}),$

and then consumed by SAM-2 to generate a mask,

$M_o = g_\phi(F, P_l)$

(Yuan et al., 7 Jan 2025).

This architecture creates a clear division of labor. The multimodal LLM performs open-ended grounding and instruction following, while SAM-2 supplies high-quality segmentation and temporal propagation (Yuan et al., 7 Jan 2025). In image tasks, SAM-2 operates as an image segmenter; in video tasks, masks are first predicted for key frames and then propagated with SAM-2 memory across the sequence (Yuan et al., 7 Jan 2025). The same basic interface underlies later descriptions of Sa2VA in challenge reports and derivative systems (Yuan et al., 1 Apr 2025, Nekrasov et al., 23 Sep 2025, Niu et al., 21 Sep 2025).

The original paper positions this as the first unified model for dense grounded understanding of both images and videos (Yuan et al., 7 Jan 2025). That claim is tied specifically to the combination of a shared LLM token space, instruction-token prompting, and SAM-2-based mask decoding across both static and dynamic visual inputs (Yuan et al., 7 Jan 2025).

3. Architecture and inference mechanics

Sa2VA couples a multimodal LLM backbone with SAM-2. The original implementation is evaluated primarily with InternVL2/2.5 as the base MLLM, and also tested with Qwen2-VL (Yuan et al., 7 Jan 2025). The visual encoder maps images or sampled video frames to visual tokens, a projection layer maps them into the LLM token space, and the LLM autoregressively fuses visual and text tokens (Yuan et al., 7 Jan 2025). SAM-2 is kept frozen by default and supplies encoder features, prompt handling, decoder inference, and temporal memory (Yuan et al., 7 Jan 2025).

In the video case, a concise formulation given in the original work is that key frames are segmented first,

$M_i = g_\phi(F_i, P_l),$

then memory is updated, and subsequent frames are segmented conditioned on memory (Yuan et al., 7 Jan 2025). “4th PVUW MeViS 3rd Place Report: Sa2VA” describes essentially the same inference structure: key-frame selection, multimodal reasoning on the selected frames and text, [SEG] extraction and projection to a language prompt embedding, SAM-2 key-frame segmentation with memory initialization, and full-video propagation (Yuan et al., 1 Apr 2025).

The following table summarizes the principal components as described in the dense grounded understanding line.

Component	Role	Representative details
Multimodal LLM	Grounded reasoning over video/image and text	InternVL2/2.5 or similar MLLM (Yuan et al., 7 Jan 2025)
`[SEG]` token	Segmentation control interface	Hidden state mapped to SAM-2 prompt embedding (Yuan et al., 7 Jan 2025)
SAM-2	Mask decoding and video propagation	Frozen by default in original Sa2VA (Yuan et al., 7 Jan 2025)
Key frames	LLM temporal context for video	Original Sa2VA samples 5 frames (Yuan et al., 7 Jan 2025)

This design is intentionally decoupled. The original paper emphasizes that the decoupled architecture preserves the strengths of both subsystems: SAM-2’s segmentation and memory capabilities, and the LLM’s visual-linguistic competence (Yuan et al., 7 Jan 2025). Later papers identify this same decoupling as both a strength and a source of practical mismatches, particularly in how SAM-2 memory is used at inference (Nekrasov et al., 23 Sep 2025).

4. Training data, objectives, and benchmark performance

Sa2VA is trained with one-shot supervised instruction tuning over a mixed dataset spanning image QA, video QA, image segmentation, video segmentation, and grounded captioning (Yuan et al., 7 Jan 2025). The data mixture is reported as about 1.2M examples, including LLaVA 1.5 image QA, ChatUniVi video QA, RefCOCO/+/g, GLaMM-style grounded conversation generation, Ref-YTVOS, MeViS, ReVOS, and Ref-SAV (Yuan et al., 7 Jan 2025). The overall supervised instruction-tuning loss is

$L_{\text{instruction}} = L_{\text{text}} + L_{\text{mask}},$

with

$L_{\text{mask}} = L_{\text{CE}} + L_{\text{DICE}}$

(Yuan et al., 7 Jan 2025).

The model is reported to achieve strong performance across multiple tasks. The original paper reports, for Sa2VA-8B, MeViS $J%%%%10%%%%F = 57.0$ , Ref-DAVIS17 $J%%%%11%%%%F = 75.2$ , and ReVOS $J%%%%12%%%%F = 57.6$ ; for Sa2VA-26B, MeViS $I = f_\theta(T, V) = h_{\text{SEG}},$ 0 and Ref-DAVIS17 $I = f_\theta(T, V) = h_{\text{SEG}},$ 1 (Yuan et al., 7 Jan 2025). On referring image segmentation, Sa2VA-26B is reported at 82.5 on RefCOCO, 78.8 on RefCOCO+, and 79.7 on RefCOCOg (Yuan et al., 7 Jan 2025). These results are presented together with strong image and video chat scores, which the paper uses to support the claim that dense grounding was added without sacrificing conversational ability (Yuan et al., 7 Jan 2025).

A central empirical element is Ref-SAV, an auto-labeled Ref-VOS dataset introduced with the original paper (Yuan et al., 7 Jan 2025). It contains 72,509 object expressions across 37,311 videos, and a manually validated subset is provided for benchmark evaluation (Yuan et al., 7 Jan 2025). The paper reports that training with Ref-SAV improves performance on complex video grounding and that even strong prior baselines remain far below Sa2VA on the Ref-SAV benchmark (Yuan et al., 7 Jan 2025).

Later challenge reports provide additional benchmark context. “4th PVUW MeViS 3rd Place Report: Sa2VA” reports 56.3 $I = f_\theta(T, V) = h_{\text{SEG}},$ 2, 52.7 $I = f_\theta(T, V) = h_{\text{SEG}},$ 3, and 59.8 $I = f_\theta(T, V) = h_{\text{SEG}},$ 4 on the PVUW MeViS challenge using Sa2VA-26B with only an inference modification and no further training (Yuan et al., 1 Apr 2025). This indicates that the base Sa2VA checkpoints were already competitive enough to serve as the backbone for high-ranking challenge submissions.

5. Variants and extensions

Several papers extend Sa2VA while keeping its central MLLM-plus-SAM2 structure.

Sa2VA-i addresses a specific inconsistency between training and inference. During original Sa2VA training in the video setting, only SAM2’s mask decoder is fine-tuned and memory components are frozen and unused; during original inference, however, memory-conditioned streaming propagation is used (Nekrasov et al., 23 Sep 2025). Sa2VA-i enforces training–inference parity for initial mask prediction by using the fine-tuned mask decoder without memory conditioning on sampled frames, then propagating with the original off-the-shelf SAM2 decoder and memory stack (Nekrasov et al., 23 Sep 2025). Using the same checkpoints, it reports improvements of up to +11.6 $I = f_\theta(T, V) = h_{\text{SEG}},$ 5 on MeViS, +1.4 on Ref-YT-VOS, +3.3 on Ref-DAVIS, and +4.1 on ReVOS (Nekrasov et al., 23 Sep 2025). It also reports that Sa2VA-i-1B reaches 52.2 $I = f_\theta(T, V) = h_{\text{SEG}},$ 6 on MeViS, matching original Sa2VA-26B at 52.1 $I = f_\theta(T, V) = h_{\text{SEG}},$ 7 (Nekrasov et al., 23 Sep 2025).

SaSaSa2VA enhances temporal modeling. It identifies two bottlenecks in Sa2VA: sparse frame sampling and reliance on a single [SEG] token for an entire video (Niu et al., 21 Sep 2025). Its remedy is Segmentation Augmented and Selective Averaged Sa2VA, which introduces Key Frame Compression and one [SEG] token per clip (Niu et al., 21 Sep 2025). In its challenge configuration, it uses $I = f_\theta(T, V) = h_{\text{SEG}},$ 8, $I = f_\theta(T, V) = h_{\text{SEG}},$ 9, $P_l = W_2 \sigma(W_1 h_{\text{SEG}}),$ 0, and $P_l = W_2 \sigma(W_1 h_{\text{SEG}}),$ 1, supervising key frames while decoding clip-wise masks with SAM2 (Niu et al., 21 Sep 2025). It reports $P_l = W_2 \sigma(W_1 h_{\text{SEG}}),$ 2, $P_l = W_2 \sigma(W_1 h_{\text{SEG}}),$ 3, and $P_l = W_2 \sigma(W_1 h_{\text{SEG}}),$ 4, ranking first in the 7th LSVOS Challenge RVOS track (Niu et al., 21 Sep 2025).

DeSa2VA modifies the prompt representation itself. It argues that Sa2VA’s direct feature fusion entangles dynamic visual information and static semantics inside a single [SEG] token, which degrades segmentation accuracy (Jisheng et al., 28 Jun 2025). DeSa2VA introduces text understanding pre-training, linear decoupling into text and visual subspaces, dynamic mask fusion, and triple supervision (Jisheng et al., 28 Jun 2025). On the reported benchmarks, DeSa2VA-4B improves over Sa2VA-4B on RefCOCO, RefCOCO+, RefCOCOg, MeViS, Ref-DAVIS17, and ReVOS while leaving QA scores unchanged (Jisheng et al., 28 Jun 2025).

Pixel-SAIL represents a more radical reinterpretation. It is presented as an expert-free path within the broader Sa2VA effort, replacing expert-assisted segmentation with a single-transformer design (Zhang et al., 14 Apr 2025). The paper states that code and models are released under the Sa2VA repository and frames Pixel-SAIL as extending the Sa2VA line from expert-integrated dense grounding to expert-free dense grounding (Zhang et al., 14 Apr 2025). This suggests that “Sa2VA” also names a broader repository and research program, not only a single architecture.

6. Audio-conditioned, domain-specific, and challenge-specific reuse

Sa2VA has also been repurposed as a modular segmentation backbone in tasks outside standard text-only Ref-VOS.

In audio-aware referring video object segmentation, “ASR-SaSaSa2VA” presents a resource-efficient framework that converts audio into textual motion descriptions via automatic speech recognition and then applies a pre-trained text-based referring video segmentation model, specifically SaSaSa2VA, for mask prediction (Wang et al., 27 Apr 2026). The paper explicitly states that Sa2VA is the original text-based referring video segmentation framework that marries an MLLM with a segmentation backbone, and that SaSaSa2VA builds upon it rather than abbreviating it (Wang et al., 27 Apr 2026). The pipeline first computes

$P_l = W_2 \sigma(W_1 h_{\text{SEG}}),$ 5

then applies a no-target expression detector,

$P_l = W_2 \sigma(W_1 h_{\text{SEG}}),$ 6

to decide whether to skip segmentation, before feeding the text description and video into SaSaSa2VA (Wang et al., 27 Apr 2026). It reports a final score of 80.7, $P_l = W_2 \sigma(W_1 h_{\text{SEG}}),$ 7, $P_l = W_2 \sigma(W_1 h_{\text{SEG}}),$ 8-acc $P_l = W_2 \sigma(W_1 h_{\text{SEG}}),$ 9, and $M_o = g_\phi(F, P_l)$ 0-acc $M_o = g_\phi(F, P_l)$ 1, ranking second in the 5th PVUW Challenge MeViS-v2-Audio track (Wang et al., 27 Apr 2026).

APRVOS, the first-place solution in the same track, also uses Sa2VA as the coarse semantic segmenter after audio transcription and visual existence verification (Miao et al., 20 Apr 2026). There Sa2VA is treated as a full-video multimodal segmenter callable as

$M_o = g_\phi(F, P_l)$ 2

(Miao et al., 20 Apr 2026). The report argues that Sa2VA alone is insufficiently robust to noisy ASR and target absence, and therefore wraps it with VibeVoice-ASR, Omni-based existence verification, and SAM3-based refinement (Miao et al., 20 Apr 2026). In its ablation table, Sa2VA-4B without judgment scores 0.45, Sa2VA-26B without judgment 0.53, Sa2VA-4B + Omni judgment 0.55, and the full system 0.67 (Miao et al., 20 Apr 2026).

In medical imaging, ARIADNE configures Sa2VA as the perception-stage vision–language foundation model for coronary angiography analysis (Jin et al., 19 Mar 2026). It instantiates Sa2VA with InternViT-6B-448px as a frozen image encoder, InternLM2 adapted via LoRA with rank $M_o = g_\phi(F, P_l)$ 3, and a SAM-2 mask decoder (Jin et al., 19 Mar 2026). Sa2VA is then fine-tuned with Dice loss, Direct Preference Optimization, and hard-sample-focused training to prioritize topological coherence (Jin et al., 19 Mar 2026). On 1,400 clinical angiograms, ARIADNE reports centerline Dice $M_o = g_\phi(F, P_l)$ 4, overall Dice $M_o = g_\phi(F, P_l)$ 5, and IoU $M_o = g_\phi(F, P_l)$ 6, with false positives reduced by 41% compared to geometric baselines (Jin et al., 19 Mar 2026). This usage shows that Sa2VA can function as a promptable dense segmenter in specialized scientific imaging, provided the alignment objective is redefined around domain-specific structure.

These adaptations suggest that Sa2VA’s practical significance lies not only in its original benchmark numbers but also in the portability of its [SEG]-to-SAM2 interface.

7. Limitations, misconceptions, and competing meanings

Several recurring limitations appear across the Sa2VA literature. The original model samples only a small number of frames for video understanding, which later works identify as inadequate for long-range temporal reasoning and motion-centric expressions (Yuan et al., 7 Jan 2025, Niu et al., 21 Sep 2025, Yuan et al., 1 Apr 2025). A single shared [SEG] token can be under-expressive for long videos with substantial appearance or motion changes (Niu et al., 21 Sep 2025). Sa2VA-i further argues that original inference misuses SAM2 memory relative to training, causing feature-distribution mismatch and degraded performance (Nekrasov et al., 23 Sep 2025). Audio-conditioned adaptations show sensitivity to transcription quality and to no-target or ambiguous audio queries (Wang et al., 27 Apr 2026, Miao et al., 20 Apr 2026). Medical adaptation work identifies a semantic–topological gap: pixel-wise overlap does not ensure structurally coherent vessel trees (Jin et al., 19 Mar 2026).

One common misconception is to treat SaSaSa2VA as a shorthand for Sa2VA. The audio-track report explicitly rejects that interpretation: Sa2VA is the original framework, whereas SaSaSa2VA is an enhanced version that extends temporal modeling with Key Frame Compression and multiple [SEG] tokens (Wang et al., 27 Apr 2026). Another misconception is that all mentions of “Sa2VA” refer to the dense grounded segmentation model. In fact, at least two unrelated uses are documented: Sound Adversarial Audio-Visual Navigation (Yu et al., 2022) and SAVANT’s alternative shorthand (Brusnicki et al., 20 Oct 2025). Those systems operate in audio-visual navigation and semantic anomaly detection rather than dense segmentation.

The table below summarizes the main meanings documented in the supplied literature.

Usage of “Sa2VA”	Task area	Relationship
“Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos” (Yuan et al., 7 Jan 2025)	Dense grounded image/video understanding	Primary usage
Sa2VA baseline in RVOS challenge reports (Yuan et al., 1 Apr 2025, Nekrasov et al., 23 Sep 2025, Niu et al., 21 Sep 2025)	Referring video object segmentation	Direct continuation
Sa2VA within APRVOS and ASR-SaSaSa2VA (Miao et al., 20 Apr 2026, Wang et al., 27 Apr 2026)	Audio-aware Ref-VOS	Reused as segmentation backbone
Sa2VA in ARIADNE (Jin et al., 19 Mar 2026)	Coronary angiography segmentation	Domain adaptation
“Sa2VA” for SAAVN (Yu et al., 2022)	Audio-visual navigation	Unrelated acronym usage
“Sa2VA” for SAVANT (Brusnicki et al., 20 Oct 2025)	Autonomous-driving anomaly detection	Unrelated shorthand

Overall, Sa2VA is best understood as a central architecture in the recent development of dense grounded multimodal systems: an MLLM-guided, SAM2-backed segmentation framework whose defining abstraction is the [SEG] token as a language-conditioned prompt interface (Yuan et al., 7 Jan 2025). Its subsequent history is characterized by improvements to temporal modeling, inference consistency, prompt factorization, audio front-ends, and domain-specific preference alignment (Niu et al., 21 Sep 2025, Nekrasov et al., 23 Sep 2025, Jisheng et al., 28 Jun 2025, Wang et al., 27 Apr 2026, Jin et al., 19 Mar 2026). At the same time, the acronym’s drift across unrelated subfields means that any technical reading of “Sa2VA” requires explicit citation-based disambiguation.