X-SAM: Unified Multimodal Segmentation

Updated 8 August 2025

X-SAM is a unified multimodal large language model framework that extends SAM to support any segmentation using integrated visual and language features.
It employs dual encoders and MLP projectors to fuse global and fine-grained features, enabling robust pixel-level segmentation across diverse tasks.
X-SAM introduces Visual GrounDed segmentation, using interactive visual prompts for cross-image, open-vocabulary instance segmentation.

X-SAM is a unified Multimodal LLM (MLLM) framework that generalizes the "segment anything" paradigm of SAM to encompass "any segmentation," enabling multimodal, pixel-level visual understanding across a wide diversity of segmentation tasks. Unlike the original SAM—which is limited in multi-mask prediction, category-specific segmentation, and cross-task architectural unification—X-SAM is designed to offer a single, streamlined architecture supporting generic, open-vocabulary, referring, interactive, reasoning, and the novel Visual GrounDed (VGD) segmentation within a unified system (Wang et al., 6 Aug 2025).

1. Unified Multimodal Architecture

X-SAM is architected to combine visual and language modalities for integrated, holistic segmentation. Its core elements include:

Dual Encoder Streams: One image encoder (e.g., SigLIP2-so400m) extracts global image-level features ( $Z_v$ ), while a segmentation encoder (e.g., SAM-L) extracts fine-grained features ( $Z_s$ ) for precise mask delineation.
MLP Projectors: Each encoder's output is projected into the LLM's embedding space via lightweight MLP heads: $H_v = W_i \cdot Z_v$ , $H_s = W_s \cdot Z_s$ .
LLM Integration: Projected features are fed, together with text input, into a LLM using a unified input protocol—both text (in special tokens) and visual queries (e.g., region, box, points, masks via a <region> token) are formatted so that the LLM seamlessly fuses multimodal context.
Segmentation Connector and Decoder: The connector merges fine-grained, multi-scale features from the segmentation encoder (via patch-merge/expand and pixel shuffle operations) for input to a Mask2Former-style segmentation decoder. The decoder operates with mask query tokens including an explicit "SEG" token reflecting the segmentation intent provided by the LLM.

This unified formatting and bridging of vision–language information enables X-SAM to support and tightly link diverse segmentation tasks, surpassing the single-task focus of classic SAM.

2. Visual GrounDed (VGD) Segmentation

VGD segmentation is a new task proposed in X-SAM, formulated to segment all instance objects using interactive visual prompts as guidance:

Interactive Prompting: Rather than solely relying on textual or prompt-limited queries, VGD segmentation uses points, scribbles, boxes, or mask regions as direct visual guidance. These are mapped into the model as “<region>” tokens within a unified input template.
Grounding Across Images: VGD can utilize visual cues from one image to guide segmentation in another, enabling cross-image grounded instance segmentation.
Pixel-Level Interpretability: By grounding segmentation in user-supplied visual prompts, VGD empowers the MLLM with explicit, interpretable pixel-wise reasoning—expanding segmentation capabilities to complex scenarios with many objects or ambiguous classes.

This task is designed to enrich the MLLM’s perceptual understanding for both human-guided and fully automated workflows.

3. Multi-Stage Unified Training Regime

X-SAM is trained using a multi-stage, multi-source co-training procedure that achieves high efficiency and generalization:

Stage 1: Segmentor Fine-Tuning The segmentation encoder and decoder are first tuned on classical segmentation datasets (e.g., COCO Panoptic), optimizing a composite loss:

$\mathcal{L}_{seg} = \mathcal{L}_{cls} + \mathcal{L}_{mask} + \mathcal{L}_{dice}$

to refine multi-class, panoptic mask prediction.

Stage 2: Vision–Language Alignment Dual MLP projectors are tuned to align the projected visual features with pre-trained LLM word embeddings, using an auto-regressive loss:

$\mathcal{L}_{regressive} = -\sum_{i=1}^N \log p_\theta(Y_q^{[P+i]}| Y_q^{[:i-1]}, X_q^{[:i-1]})$

where $X_q$ is the instruction, and $Y_q$ the generated output.

Stage 3: Mixed Fine-Tuning Co-trains across a heterogeneous mixture of segmentation tasks and image conversation datasets, with a total loss:

$\mathcal{L}_{total} = \mathcal{L}_{regressive} + \mathcal{L}_{seg}$

for segmentation tasks (only $\mathcal{L}_{regressive}$ for dialog tasks), enforcing cross-task feature-sharing and robustness.

This unified protocol allows for effective co-training on diverse data sources—crucial for any-segmentation generalizability.

4. Comprehensive Segmentation Capabilities

X-SAM is designed for maximal segmentation diversity in a single model, supported by:

Unified Input and Output Protocol: Both instructions and prompts, whether textual (“segment the [object]”) or visual (“<region>”), are embedded in a consistent format. The output mask is marked by a special <SEG> token, enabling the LLM to explicitly trigger segmentation prediction.
Scalable Prompt Handling: Visual queries span points, scribbles, boxes, and regions; text queries range from simple categories to complex, multi-step instructions or referring expressions.
Latent Background Embedding: For tasks where objects must be detected against “ignore” categories, an explicit background embedding is used, harmonizing semantic, panoptic, open-vocabulary, and interactive segmentation.
Cross-task Generalization: By integrating all cues and supervision types under this architecture, X-SAM unifies formerly siloed segmentation pipelines.

5. Experimental Performance and Benchmarks

Benchmarking demonstrates that X-SAM delivers state-of-the-art metrics across >20 segmentation datasets encompassing:

Task Type	Metric Improvement	Representative Datasets
Panoptic/Instance/Semantic Segmentation	mIoU, PQ, AP—on par or superior to task-specific models	COCO Panoptic, ADE20K
Open-Vocabulary and Referring Segmentation	cIoU, gIoU—several points above leading MLLMs	RefCOCO
GCG and VGD Segmentation	AP gains of over 45% across prompt modalities	Newly introduced benchmarks

This unified approach provides not only strong quantitative performance (including large improvements in AP on VGD) but also fine-grained segmentation particularly in multimodal and interactive contexts.

6. Technical Details

Key details enabling X-SAM’s extensibility and reproducibility:

Architecture: Segmentation decoder is inspired by Mask2Former with multi-scale features and latent mask tokens. Patch-merge/expand (pixel shuffle) operations in the connector provide scale flexibility.
Input Specification: Text and vision queries are tokenized into the LLM input, e.g.:
- Text: “segment all apples”
- Visual Region: “<region>”
- Segmentation Output: “<SEG>”
Loss Functions: Multi-term (classification, mask, dice, autoregression) to enable multi-task co-optimization.
Implementation: Runs on high-performance infrastructure (16 × A100 GPUs for large dataset experiments); code is open-sourced at https://github.com/wanghao9610/X-SAM, with dependencies outlined in the XTuner codebase.

7. Significance, Limitations, and Future Directions

X-SAM represents a transition from “segment anything”—with SAM’s task- and prompt-limited focus—to “any segmentation” within a single architecture, accommodating the breadth of multimodal segmentation demands (Wang et al., 6 Aug 2025). The introduction of VGD tasks expands the landscape for pixel-level, prompt-grounded interpretability.

The architecture’s modularity and efficient training regimen suggest extensibility to additional modalities or even further scale. This suggests that future extensions (e.g., extension to temporal/video, further dialogue integration, clinical applications) could benefit from the unified framework and codebase.

There remain plausible limitations: reliance on large-scale compute for optimal performance, and the architectural assumption that all segmentation types are bridgeable via a unified connector/LMM protocol. A plausible implication is that, as segmentation tasks and prompts become increasingly heterogeneous, further development of the connector and input format may be needed for continued scalability.

In summary, X-SAM provides a unified, scalable platform for multimodal and interactive image segmentation, establishing new baselines for pixel-level visual understanding and flexible integration of language and vision.

PDF Markdown Chat (Pro)

References (1)

X-SAM: From Segment Anything to Any Segmentation (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to X-SAM.