Papers
Topics
Authors
Recent
2000 character limit reached

Grounded SAM: Open-Vocab Vision Pipeline

Updated 19 January 2026
  • Grounded SAM is an integrated vision pipeline combining open-vocabulary detection (Grounding DINO) with promptable segmentation (SAM) to enable flexible image analysis.
  • It achieves state-of-the-art zero-shot segmentation performance and modular plug-and-play capabilities for applications like automated annotation, image editing, and 3D human motion analysis.
  • The framework supports customizable task pipelines without monolithic retraining, though challenges include detection bias and computational latency.

Grounded SAM is an integrated vision pipeline that enables open-vocabulary detection and segmentation of image regions conditioned on arbitrary text inputs. By assembling Grounding DINO—a text-driven, open-set object detector—and the Segment Anything Model (SAM)—a promptable segmentation backbone—Grounded SAM achieves diverse visual tasks without monolithic retraining, providing modularity and extensibility for applications such as automated annotation, controllable image editing, and promptable 3D human motion analysis. This architecture shows state-of-the-art zero-shot performance on segmentation benchmarks, highlighting its utility as a plug-and-play assembly for open-world visual modeling (Ren et al., 2024).

1. Component Architecture and Workflow

Grounded SAM operates as a composition rather than a single end-to-end network. The main components are:

  • Grounding DINO (Liu et al., 2023): An open-set object detector that accepts an image IRH×W×3I\in\mathbb{R}^{H\times W \times 3} and a text prompt TT and outputs NN bounding boxes B={(bi,si)}i=1N\mathcal{B} = \{(b_i, s_i)\}_{i=1}^N where bib_i describes spatial coordinates and sis_i is a grounding score.
  • Segment Anything Model (SAM) (Kirillov et al., 2023): For each detected box bib_i, SAM predicts a binary mask Mi{0,1}H×WM_i \in \{0,1\}^{H \times W}, with the box serving as the prompt.

The process is illustrated as follows:

1
2
3
4
5
Input: (I, T)
    ├─→ [Grounding DINO] ──→ Boxes {(b_i, s_i)}
    │
    └─→ For each b_i:
         └─→ [SAM mask-predictor ◂ box b_i] ──→ Mask M_i

Pseudocode:

1
2
3
4
5
6
7
def GroundedSAM(I, T):
    Boxes = GroundingDINO.detect(I, T)  # ℬ = {b_i, s_i}
    Masks = []
    for (b, s) in Boxes:
        M = SAM.predict_mask(I, box_prompt=b)
        Masks.append((b, s, M))
    return Masks

This architecture supports modular plug-ins and expert components for extended pipelines (e.g., BLIP, RAM, Stable-Diffusion, OSX).

2. Mathematical Formulation

The detection and segmentation stages are mathematically defined as follows:

Open-set Detection (Grounding DINO):

  • Text embedding: Etext(T)RdE_{\text{text}}(T) \in \mathbb{R}^d
  • Visual features: Eimg(I)RH×W×CE_{\text{img}}(I) \in \mathbb{R}^{H'\times W'\times C}
  • Learned object queries {qj}j=1Nq\{q_j\}_{j=1}^{N_q} attend to (Eimg,Etext)(E_{\text{img}}, E_{\text{text}}).
  • For each query, classification score p^ij\hat p_{ij} and bounding box b^ij\hat b_{ij}.
  • Grounding score:

si=σ(fcls(qi,Eimg,Etext))s_i = \sigma(f_{\text{cls}}(q_i, E_{\text{img}}, E_{\text{text}}))

where σ\sigma denotes sigmoid.

  • Training loss:

Ldet=i=1Nq[Lcls(pi,pi)+1{pi>0}(λ1b^ibi1+λ2GIoU(b^i,bi))]\mathcal{L}_{\text{det}} = \sum_{i=1}^{N_q}\Bigl[ \mathcal{L}_{\text{cls}}(p_i, p_i^*) + \mathbb{1}_{\{p_i^*>0\}}\bigl( \lambda_1 \|\hat b_i - b_i^*\|_1 + \lambda_2\,\mathrm{GIoU}(\hat b_i, b_i^*) \bigr) \Bigr]

Promptable Segmentation (SAM):

  • Segmentation mask predicted for each box:

M^=σ(fmask(Eimg,b))\hat M = \sigma\bigl(f_{\text{mask}}(E_{\text{img}}, b)\bigr)

  • Mask loss:

Lmask=1HWu,v[Mu,vlogM^u,v+(1Mu,v)log(1M^u,v)]\mathcal{L}_{\text{mask}} = -\frac{1}{HW}\sum_{u,v}\bigl[ M^*_{u,v}\log\hat M_{u,v} + (1 - M^*_{u,v})\log(1-\hat M_{u,v}) \bigr]

Overall Objective: No joint fine-tuning is performed; both models are used off-the-shelf. For end-to-end training, the objective becomes:

L=Lcls+Lbboxdetection+Lmasksegmentation\mathcal{L} = \underbrace{\mathcal{L}_{\text{cls}} + \mathcal{L}_{\text{bbox}}}_{\text{detection}} + \underbrace{\mathcal{L}_{\text{mask}}}_{\text{segmentation}}

3. Task Pipelines and Representative Use Cases

Grounded SAM serves as a foundation for a variety of task-specific pipelines:

3.1 Automatic Dense Annotation

  • BLIP-Grounded-SAM: BLIP captioner produces captions from II; each noun phrase in caption CC prompts GroundedSAM, yielding bi,Mib_i, M_i for object instance annotation.
  • RAM-Grounded-SAM: Recognize Anything Model (RAM) generates tags {tk}\{t_k\}; each tag serves as a prompt for instance mask generation.

Example:

Input Output
I: cow/rainbow Box: (100, 30, 320, 260) Mask: cow

3.2 Controllable Image Editing

User supplies prompt TT (e.g. “remove dog”). GroundedSAM produces relevant mask MiM_i, fed into Stable Diffusion inpainting for content manipulation.

3.3 Promptable 3D Human Motion Analysis

Prompt TT (“person in pink shirt”) yields instance mask and box, cropped to run OSX mesh recovery for obtaining parametric 3D body meshes.

4. Performance on Open-Vocabulary Benchmarks

On the SegInW zero-shot segmentation benchmark spanning 25 datasets, Grounded SAM demonstrates strong results:

Method mean SGinW
X-Decoder-L 32.2
OpenSeeD-L 36.7
ODISE-L 38.7
SAN-CLIP-ViT-L 41.4
UNINEXT-Huge 42.1
Grounded-HQ-SAM (Base+DINO + HQ-SAM) 49.6
Grounded-SAM (Base+DINO + SAM-Huge) 48.7

Grounded SAM (DINO-Base + SAM-Huge) achieves mean AP 48.7, a significant improvement (+6 points) over the next best competitor. Integration of SAM-HQ mask backbone further raises performance to 49.6.

5. Modularity, Extensibility, and Limitations

Strengths:

  • Open vocabulary: Directly localizes arbitrary text queries ("Gazania linearis," "Zale Horrida") without fine-tuning.
  • Modularity: Each step (detection, segmentation, downstream application) is independently inspectable.
  • Plug-and-play: Easily replace or augment components (e.g., FastSAM for speed, LLM controllers for prompt routing).

Limitations:

  • Detection bias and coverage depend on pre-trained detectors; exotic or diminutive entities may elude detection.
  • Inference latency arises from cascading two large backbones; lighter alternatives (FastSAM, MobileSAM) offer remediation.

A plausible implication is that modularity enables rapid assembly of tailored pipelines while maintaining interpretability of intermediate outputs, although limitations in coverage and computational cost persist.

6. Prospective Directions

Proposed future extensions include:

  • End-to-end fine-tuning: To improve alignment across detection and segmentation stages.
  • Vision–language large model controllers: Automating expert selection and pipeline routing.
  • Video and 3D extension: Incorporation of transformer-tracking (e.g., DEVA), NeRFs, or depth estimation modules for temporal and spatial modeling.

These directions suggest that the framework can be adapted for more complex modalities and enhanced automation, leveraging its inherent modular architecture.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (3)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to GroundedSAM.