Tarot-SAM3: Training-free SAM3 for Any Referring Expression Segmentation

Published 9 Apr 2026 in cs.CV | (2604.07916v1)

Abstract: Referring Expression Segmentation (RES) aims to segment image regions described by natural-language expressions, serving as a bridge between vision and language understanding. Existing RES methods, however, rely heavily on large annotated datasets and are limited to either explicit or implicit expressions, hindering their ability to generalize to any referring expression. Recently, the Segment Anything Model 3 (SAM3) has shown impressive robustness in Promptable Concept Segmentation. Nonetheless, applying it to RES remains challenging: (1) SAM3 struggles with longer or implicit expressions; (2) naive coupling of SAM3 with a multimodal LLM (MLLM) makes the final results overly dependent on the MLLM's reasoning capability, without enabling refinement of SAM3's segmentation outputs. To this end, we present Tarot-SAM3, a novel training-free framework that can accurately segment from any referring expression. Specifically, Tarot-SAM3 consists of two key phases. First, the Expression Reasoning Interpreter (ERI) phase introduces reasoning-assisted prompt options to support structured expression parsing and evaluation-aware rephrasing. This transforms arbitrary queries into robust heterogeneous prompts for generating reliable masks with SAM3. Second, the Mask Self-Refining (MSR) phase selects the best mask across prompt types and performs self-refinement by leveraging rich feature relationships from DINOv3 to compare discriminative regions among ERI outputs. It then infers region affiliation to the target, thereby correcting over- and under-segmentation. Extensive experiments demonstrate that Tarot-SAM3 achieves strong performance on both explicit and implicit RES benchmarks, as well as open-world scenarios. Ablation studies further validate the effectiveness of each phase.

Abstract PDF Upgrade to Chat

Authors (8)

Summary

The paper introduces a training-free framework that combines MLLM reasoning with DINOv3-driven mask refinement to address both explicit and implicit referring expression segmentation.
It demonstrates state-of-the-art performance with 75.5 gIoU on RefCOCO and strong generalization across open-world scenarios, outperforming previous baselines.
Component ablation confirms that the ERI and MSR modules are critical for enhancing segmentation accuracy and scalability without relying on annotated training data.

Tarot-SAM3: Training-Free SAM3 for Any Referring Expression Segmentation

Motivation and Problem Formulation

Referring Expression Segmentation (RES) is a core vision-language task that aims to segment regions in images specified by natural language expressions. Traditional RES systems are heavily reliant on large annotated datasets and specialize either in explicit (direct object identification) or implicit (contextual, relational, or commonsense-based) expressions, impeding their scalability to open-world scenarios. Recent advances like Segment Anything Model 3 (SAM3) offer robust promptable concept segmentation but suffer when deployed directly for RES due to limited capacity for complex expressions and naive MLLM-SAM3 coupling, which over-relies on MLLM outputs and ignores mask refinement.

Tarot-SAM3 addresses these bottlenecks by proposing a training-free framework—jointly leveraging the reasoning capabilities of MLLM and object-aware feature coherence from DINOv3—to enable accurate segmentation for any referring expression, explicit or implicit.

Tarot-SAM3 Framework: System Design

Tarot-SAM3 employs a two-phase strategy:

1. Expression Reasoning Interpreter (ERI) Phase:

ERI decomposes the input expression into structured prompt types via reasoning-assisted options and evaluation-aware rephrasing. ERI stabilizes MLLM responses, extracting target object names, refer-object sets, and generates heterogeneous prompts for SAM3, including explicit text prompts, bounding boxes, and evaluation-guided rephrasings.

2. Mask Self-Refining (MSR) Phase:

MSR selects the optimal masks across prompt types and refines them via DINOv3-driven feature relationships. Discriminative regions are analyzed for over- or under-segmentation, and segmentation correction is performed through targeted modification of prompt anchors.

Figure 1: Overview of the Tarot-SAM3 framework, illustrating the ERI and MSR pipeline for training-free robust RES.

This architecture bridges linguistic reasoning and geometric segmentation, improving adaptability and visual consistency without task-specific training.

Performance Evaluation and Comparative Analysis

Explicit RES Benchmarks

Tarot-SAM3 is evaluated on RefCOCO, RefCOCO+, and RefCOCOg datasets. In zero-shot settings, Tarot-SAM3 with Qwen2.5-VL 7B backbone achieves 75.5 gIoU on RefCOCO testA, outperforming previous training-free baselines (e.g., EVOL-SAM3) by +1.8 and SAM3 Agent by +11.2.

Figure 2: Example visualizations on explicit (left) and implicit (right) RES benchmarks, showing mask quality.

Implicit RES and ReasonSeg

On the ReasonSeg benchmark, Tarot-SAM3 consistently outperforms both dataset-specific fine-tuned and training-free competitors. For Qwen2.5-VL 7B, Tarot-SAM3 achieves 74.3 gIoU and 68.8 cIoU on the ReasonSeg test set—surpassing EVOL-SAM3 (+1.8), SAM3 Agent (+11.3), and even outscoring fine-tuned methods (SAM-Veteran: +11.7 gIoU).

Open-World Generalization

Tarot-SAM3 maintains superior robustness in open-world settings, including domain-shifted images (e.g., anime). This evidences strong generalization and adaptability beyond standard benchmarks.

Figure 3: Open-world visual comparisons and Tarot-SAM3 failure cases; left panel shows robust segmentation under diverse queries, right panel illustrates remaining ambiguity challenges.

Ablation Studies and System Component Contributions

Individual ablations highlight the vital contributions of the ERI and MSR modules:

ERI: Reasoning-assisted prompts (RPO), text augmentation, and bbox augmentation cumulatively yield substantial performance gains (e.g., +6.4 gIoU for RPO, +11.5 for text augmentation, +20.0 for bbox augmentation).
MSR: Inter-prompt selection (IPS) and object-awareness prompt modification (OPM) further enhance segmentation (+1.1 gIoU for OPM).
Figure 4: Sequential visualization of ERI and MSR phases, demonstrating progressive mask refinement.

Figure 5: Ablation visualizations show the impact of text prompt augmentation (left) and rephrased expression generation (right) on mask accuracy.

Practical and Theoretical Implications

Tarot-SAM3 demonstrates scalable, training-free RES. By integrating structured reasoning with prompt consistency filtering and feature-level refinement, it achieves closed-loop segmentation—mitigating reliance on MLLM interpretation and static SAM3 outputs. This architecture enables:

Practical deployment for multimodal reasoning, visual dialogue, and embodied navigation tasks without annotated datasets.
Theoretical advancements in unified processing of explicit/implicit expressions and prompt-adaptive segmentation.

Tarot-SAM3’s modularity enables rapid extension to future, possibly non-image domains such as video-based referring tasks and more complex multi-object reasoning.

Limitations and Prospective Directions

Despite achieving state-of-the-art in training-free segmentation, Tarot-SAM3 has residual limitations:

Difficulty in spatially ambiguous queries (“far left crate”), resulting in occasional mis-segmentation.
Challenges in region-level granularity for queries targeting sub-object regions.

Future work should emphasize accelerating inference, optimizing ERI and MSR for greater efficiency and generalization, and extending the framework to temporal and video-centric tasks, such as referring object tracking and sequential reasoning.

Conclusion

Tarot-SAM3 introduces a robust, training-free solution for universal referring expression segmentation, unifying ERI-driven multimodal reasoning with MSR-based mask refinement. Comprehensive experiments validate its superiority on explicit, implicit, and open-world benchmarks. Component-wise ablations confirm each module's effectiveness. This framework sets a new benchmark for scalable, annotation-free RES, with broad implications for future multimodal AI systems.

Markdown Report Issue