Papers
Topics
Authors
Recent
Search
2000 character limit reached

Grounded-SAM: Open-World Text-to-Mask Model

Updated 25 February 2026
  • Grounded-SAM is a modular assembly of Grounding DINO and SAM, enabling direct text-to-mask segmentation with robust zero-shot generalization.
  • It leverages Grounding DINO to generate bounding boxes from natural language prompts, followed by SAM's high-resolution mask decoding without additional fine-tuning.
  • Validated across open-world tasks, GSAM improves pseudo-labeling, dense annotation, and prompt-driven image editing, bridging the label-efficiency gap.

Grounded-SAM is a model assembly paradigm integrating open-vocabulary grounding and segmentation capabilities by composing Grounding DINO, an open-set language-driven object detector, and the Segment Anything Model (SAM), a promptable segmentation network based on Vision Transformers. This architecture is designed to realize direct text-to-mask functionality, enabling detection and segmentation of arbitrary user-specified objects from natural-language prompts with robust zero-shot generalization. Grounded-SAM has been validated for open-world segmentation, data-frugal self-supervised learning, and as a modular core for diverse downstream tasks, including automatic dense annotation, controllable image editing, and prompt-based 3D pose reconstruction (Ren et al., 2024, Pijarowski et al., 2024).

1. Architectural Composition and Text-to-Mask Pipeline

Grounded-SAM explicitly composes two state-of-the-art open-world models:

  • Grounding DINO: Accepts image II and natural-language text TT (e.g., “soldier”), generating set {(bi,pi)}\{(b_i, p_i)\} of object bounding boxes bib_i and associated confidence scores pip_i by cross-modal attention between image and text features.
  • Segment Anything Model (SAM): Receives II and prompt(s) (points, boxes, or polygons), then outputs high-resolution segmentation mask(s) mim_i. The mask decoder combines an image tokenization via ViT with the prompt embedding eprompte_{\text{prompt}} for each region.

The operational pipeline is strictly modular:

  1. The text prompt TT is processed by Grounding DINO to yield up to NN regions bib_i with confidence pip_i.
  2. Proposals are filtered by pitcp_i\geq t_c (with tct_c typically in [0.15,0.3][0.15, 0.3]); at least one candidate per image is guaranteed by fall-back to argmaxipi\operatorname{argmax}_{i} p_i.
  3. Each bib_i is transformed into a box prompt and fed, together with II, to SAM, yielding segmentation masks MiM_i.
  4. The selected MiM_i (highest-confidence or union) serve as output or as “pseudo-labels” for downstream model training.

Notably, both Grounding DINO and SAM are used in their frozen, pre-trained state; no parameter finetuning is performed during assembly. Adaptation for new scenarios is limited to prompt selection and confidence thresholding (Ren et al., 2024, Pijarowski et al., 2024).

2. Mathematical Formalisms and Evaluation

Grounded-SAM’s architecture and learning are analytically characterized as follows:

  • Detection loss (Grounding DINO):

cost(i,j)=logpi(cj)+λ1bigj1+λ2(1IoU(bi,gj))\mathrm{cost}(i,j) = -\log p_i(c_j) + \lambda_1 \|b_i - g_j\|_1 + \lambda_2 (1-\mathrm{IoU}(b_i,g_j))

Ldet=j=1N[FocalLoss(pσ(j),cj)+λ1bσ(j)gj1+λ2(1IoU(bσ(j),gj))]\mathcal{L}_\mathrm{det} = \sum_{j=1}^N \bigl[ \mathrm{FocalLoss}(p_{\sigma(j)},c_j) + \lambda_1\|b_{\sigma(j)}-g_j\|_1 + \lambda_2(1-\mathrm{IoU}(b_{\sigma(j)},g_j)) \bigr]

  • Mask loss (SAM):

Lmask(mi,Mi)=BCE(mi,Mi)+αDice(mi,Mi)\mathcal{L}_\mathrm{mask}(m_i, M_i^*) = \mathrm{BCE}(m_i, M_i^*) + \alpha\,\mathrm{Dice}(m_i, M_i^*)

Lpseudo(θ)=1Ni=1N(pi(θ),y^i)\mathcal{L}_\mathrm{pseudo}(\theta) = \frac{1}{N} \sum_{i=1}^N \ell\left(p_i(\theta), \hat{y}_i\right)

where y^i\hat{y}_i are GSAM-derived masks, pi(θ)p_i(\theta) is the foreground probability from downstream detector, and \ell is pixel-wise cross-entropy.

  • Evaluation (e.g., Segmentation in the Wild, CPD1K):
    • Mean Average Precision (mAP\mathrm{mAP}) over open-vocabulary classes
    • Weighted F-measure FβwF^w_\beta (with β2=1\beta^2=1)
    • Mean Absolute Error (MAE), S-measure (SS), E-measure (EϕE_\phi)
    • False-Positive Rate (FPR), True-Negative Rate (TNR) on background scenes

GSAM achieves competitive performance, e.g., 48.7mAP48.7\,\mathrm{mAP} on SGinW (Grounding DINO-base + SAM-huge) (Ren et al., 2024), and, for data-efficient camouflaged human detection, optimized pseudo-labeling allows the downstream HitNet model (with 30 images) to match “30-shot” supervised baselines within 10% of full training on CPD1K (Pijarowski et al., 2024).

3. Pseudo-Labeling, Self-Supervised Frugal Learning, and Post-Processing

Pseudo-labeling in Grounded-SAM is central for label-efficient downstream transfer:

  • GSAM Pseudo-Labeling Algorithm:

1
2
3
4
5
Input: image I, text prompt T, threshold t_c
B, p = GroundingDINO(I, prompt=T)
keep = {i | p_i  t_c} if keep else {argmax_i p_i}
Masks = [SAM(I, box=B_i) for i in keep]
Output: choose highest-confidence M* or union(Masks)

  • Confidence Thresholding: Only masks above detector confidence tct_c are retained.
  • Loss Re-weighting: Each pseudo-label loss is scaled by normalized pip_i; Li=piLiL'_i=p'_i L_i.

Comparative experiments on various pseudo-label strategies (GSAM, MAT-inpainting error, HitNet-animal) establish that GSAM supplies higher-quality, high-recall pseudo-labels for previously unseen classes (e.g., camouflaged humans) than both inpainting-based and naive class transfer approaches.

Optimized GSAM pseudo-labels (with selection/filtering and reweighting) allow transformer-based COD architectures (e.g., HitNet) to close the performance gap to fully supervised learning using minimal or zero manual mask annotation (Pijarowski et al., 2024).

4. Downstream Applications and Extensions

The modular design of Grounded-SAM enables rapid adaptation for a variety of open-world vision tasks:

  • Automatic Dense Annotation: Integration with image tagging (e.g., Recognize Anything) or BLIP image captioning allows automatic extraction of either coarse tags or fine-grained noun phrases, which are then grounded and segmented, producing dense instance-level annotations without human labeling.
  • Promptable Image Editing: By localizing target objects (e.g., “the dog’s fur”) through Grounded SAM, instance masks medit,Tm_{\mathrm{edit},T'} are provided to inpainting models such as Stable Diffusion, supporting precise, text-driven image manipulation.
  • Prompt-based 3D Human Analysis: Cropping the localized detected region and passing it to models like OSX enables instance-specific, promptable full-body 3D mesh reconstruction and motion capture.

A summary of core application modules is given below:

Application Area Upstream Model Downstream Utility
Dense annotation BLIP, Recognize Anything Rapid mask generation per user/entity
Image editing Stable Diffusion Precise inpainting guided by grounded mask
3D human motion analysis OSX Prompt-to-mesh for user-referred individuals

5. Benchmarking, Ablation Studies, and Performance Insights

Comprehensive benchmarking on open-vocabulary segmentation and camouflaged object detection supports several key conclusions:

  • Open-vocabulary segmentation (SGinW):
    • Grounded-SAM with Grounding DINO-base+SAM-Huge: 48.7mAP48.7\,\mathrm{mAP} zero-shot, outperforming prior unified approaches.
    • Upgrading to HQ-SAM: 49.6mAP49.6\,\mathrm{mAP}.
    • Prompting SAM with DINO-derived boxes yields masks of significantly higher quality for open-set phrases than alternative SAM prompts (points, scribbles).
  • Frugal Self-supervised COD (CPD1K, HitNet):
    • Baseline: Zero-shot Fβw=0.564F^w_\beta=0.564; full supervised (k=600k=600): Fβw=0.828F^w_\beta=0.828.
    • GSAM zero-shot: Fβw=0.722F^w_\beta=0.722; fine-tuned on 30 GSAM pseudo-labels: Fβw=0.748F^w_\beta=0.748.
    • GSAM pseudo-labels cleanly bridge the label-efficiency gap with 30-shot supervised approaches (Pijarowski et al., 2024).
  • Module ablations reveal:
    • Using SAM-Large in place of SAM-Huge degrades mAP by 2.7.
    • Removing the text encoder from Grounding DINO reduces mAP by >5.
    • Using suboptimal SAM prompts causes up to 10mAP10\,\mathrm{mAP} loss, establishing box-prompting as critical for segmentation precision (Ren et al., 2024).

A notable limitation: direct GSAM pseudo-labels hallucinate foregrounds on empty images (FPR =0.68=0.68), but frugal downstream finetuning corrects the background prior (FPR =0.02=0.02 after HitNet adaptation).

6. Limitations, Systematic Weaknesses, and Comparative Analysis

Although Grounded-SAM exhibits compelling zero-shot and frugal learning performance, it manifests structural and practical limitations:

  • Over-segmentation and Hard Label-Mask Assignment: As a prompt-driven pipeline, Grounded-SAM binds each DINO proposal box (and its corresponding segmentation mask) to a single class or phrase, and cannot refine instances when boundary coherence is poor or when multiple fragments are produced.
  • No End-to-End Optimization: Model assembly uses completely frozen weights, limiting adaptation to new domains or granularity beyond what DINO/SAM were pre-trained for.
  • Computational Cost: Each mask requires a full forward pass through SAM’s decoder; use of dense or coarse prompts can induce significant latency.

Compared to advanced mask-injection approaches (e.g., SAM-MI (Chen et al., 25 Nov 2025)), Grounded-SAM remains susceptible to SAM’s over-segmentation and rigid class-assignment weaknesses. SAM-MI introduces sparse point prompting (reducing decoder calls by 96%), shallow mask aggregation, and decoupled mask injection, achieving a 16.7%16.7\% relative mIoU improvement and 1.6×1.6\times speedup over Grounded-SAM in open-vocabulary benchmarks. This suggests that “mask injection” paradigms address critical structural weaknesses inherent to the prompt-driven box+sandbox model assembly strategy (Chen et al., 25 Nov 2025).

7. Research Impact and Prospective Directions

Grounded-SAM establishes a foundational paradigm for open-world vision by demonstrating that modular assembly of strong pretrained detectors and segmentation models enables broad, zero-shot text-to-mask capability with robust performance on real-world and open-vocabulary challenges (Ren et al., 2024, Pijarowski et al., 2024). Its success in frugal self-supervised learning and annotation automation has spurred follow-up studies focusing on:

  • Efficient Prompting and Aggregation: Mask aggregation (e.g., to correct over-segmentation) and sparse adaptive prompting (e.g., TSPP in SAM-MI) are active areas for reducing computational load and improving mask quality.
  • Soft Mask-Label Integration: Moving beyond 1:1 mask-label assignment via decoupled mask fusion or injection into segmentation heads.
  • Domain Specialization: While generalist, the architecture’s performance on highly non-canonical images (e.g., camouflaged, highly occluded, or compositionally complex objects) is contingent on effective pseudo-label filtering and postprocessing.

A plausible implication is that future open-world segmentation systems will increasingly favor tightly integrated, end-to-end trainable modules with mask-injection mechanisms, rather than modular prompt-driven pipelines, particularly as annotation and computational efficiency become central constraints (Chen et al., 25 Nov 2025).

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Grounded-SAM.