Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
GPT-5.1
GPT-5.1 130 tok/s
Gemini 3.0 Pro 29 tok/s Pro
Gemini 2.5 Flash 145 tok/s Pro
Kimi K2 191 tok/s Pro
Claude Sonnet 4.5 34 tok/s Pro
2000 character limit reached

Grounding DINO: Transformer for Open-Vocabulary Detection

Updated 16 November 2025
  • The paper introduces a transformer-based framework that fuses visual and language features to achieve robust open-set detection, phrase grounding, and referring expression comprehension.
  • It leverages multi-stage cross-modal fusion with a Swin-Transformer or EfficientViT backbone and BERT text encoder, eliminating the need for non-maximum suppression in a fully end-to-end design.
  • The model variants, including Grounding DINO 1.5 and Dynamic-DINO, demonstrate state-of-the-art zero-shot detection and efficient real-time deployment with significant architectural innovations.

Grounding DINO is a transformer-based open-set object detection framework that achieves robust performance on open-vocabulary detection (OVD), phrase grounding (PG), and referring expression comprehension (REC) by performing deep cross-modal fusion of vision and language signals. Initially developed by IDEA Research, it forms a core architecture widely adopted for open-vocabulary and grounding tasks. Grounding DINO and its follow-ons, including Grounding DINO 1.5 and Dynamic-DINO, set new benchmarks in zero-shot detection and efficient real-time deployment, notably extending detection to arbitrary categories specified in free-form text.

1. Foundations and Architectural Principles

Grounding DINO builds on the DINO detector ("DETR with Improved NOise-robust anchor boxes"), a non-anchor, end-to-end transformer detector. The core architectural advance of Grounding DINO is the introduction of multi-stage, "tight" fusion of visual and language representations throughout the vision backbone, query initialization, and transformer decoder stack (Liu et al., 2023).

Major Components

  • Image Backbone: Swin-Transformer (Tiny/Large) or more recently EfficientViT/ViT-L, extracting multi-scale image features {F1,...,F4}\{F_1, ..., F_4\}.
  • Text Backbone: BERT-uncased, encoding variable-length token embeddings TRL×dT \in \mathbb{R}^{L \times d}.
  • Feature Enhancer: Sequential bi-attention and self-attention blocks align language and vision features at multiple scales, including both image → text and text → image attention.
  • Language-Guided Query Selection: Top-K (e.g., K=900K=900 in original, K=100K=100 in 1.5 Pro/Edge) image token embeddings, judged most relevant to the text prompt via cosine similarity, are used as decoder "position" queries ensuring grounding sensitivity at initialization.
  • Cross-Modality Decoder: Each transformer decoder layer performs (1) self-attention over queries, (2) cross-attention to vision features, (3) cross-attention to language features, and (4) feed-forward transformation. Deep fusion at this stage aligns each query to both image and phrase context.
  • Prediction Head: The output from the decoder yields bounding box predictions BRK×4B \in \mathbb{R}^{K \times 4} and classification logits CRK×Ltext+1C \in \mathbb{R}^{K \times |L_\text{text}+1|}.
  • End-to-End Detection: No non-maximum suppression (NMS), with one-to-one Hungarian bipartite matching between queries and ground-truth objects.

This architecture is fully end-to-end differentiable and fundamentally NMS-free. The selection of queries guided by language ensures responsiveness to open-vocabulary prompts and reduces the reliance on fixed, learned queries.

2. Model Variants: From Grounding DINO to Dynamic-DINO

The evolutionary pathway from Grounding DINO (original, 1.0) through Grounding DINO 1.5 (Pro and Edge) to Dynamic-DINO involves architectural modifications targeting both accuracy and efficiency.

  • Grounding DINO 1.5 Pro: Employs a larger ViT-L backbone, deep early-fusion, and is trained on >20 million grounding-annotated images. Decoder uses L=6L=6 layers, d=256d=256, N=100N=100 queries.
  • Grounding DINO 1.5 Edge: Optimized for edge deployment. EfficientViT-L1 backbone, single-scale early fusion, efficient feature enhancer (EFE) that cross-attends only the deepest image features to language, and light cross-scale upsampling for shallower layers.
  • Design Reductions: The Edge variant removes heavy deformable attention on non-final feature maps, reducing memory and compute by 30–50% relative to Pro.
  • Mixture-of-Experts (MoE) Decoder: The decoder’s FFNs are replaced by a conditional MoE supernet. Each original FFN is decomposed into NN groups, each with kk expert sub-FFNs, amounting to kNkN experts per layer.
  • Dynamic Routing: At inference, for each input token, a learned router activates only the kk most relevant experts (usually k=2k=2), maintaining active compute equivalent to the vanilla dense decoder but leveraging a much larger parameter pool.
  • Expert Initialization: Pre-trained weights are partitioned/sliced such that their sum exactly reproduces the pre-trained FFN output, and the router is initialized to activate all kk experts from the same FFN group, avoiding cold-start degradation.
  • Auxiliary Load-Balancing Loss: Encourages diverse expert utilization, avoiding router collapse to a few dominant experts.
  • Efficiency: Inference cost per token remains constant; only the pool of parameters grows. On edge devices, Dynamic-DINO maintains real-time inference rates with marginally reduced FPS.

3. Training Objectives, Data, and Protocols

Grounding DINO and its descendants employ multiple objectives tailored for open-set detection, grounding, and referring comprehension tasks.

Loss Functions

The overall loss per image-text pair is a sum over matched query-ground-truth pairs:

  • Classification (Contrastive Focal Loss):

FocalLoss(pt)=αt(1pt)γlog(pt)\mathrm{FocalLoss}(p_t) = -\alpha_t (1 - p_t)^\gamma \log(p_t)

where ptp_t is the predicted probability of the true label.

  • Box Regression: Sum of L1L_1 and GIoU losses:

L1(Bi,bj)=Bibj1LGIoU(Bi,bj)=1GIoU(Bi,bj)L_1(B_i, b_j) = || B_i - b_j ||_1 \qquad L_\text{GIoU}(B_i, b_j) = 1 - \mathrm{GIoU}(B_i, b_j)

  • Grounding-specific Loss (1.5): Additional binary cross-entropy on grounding scores.
  • Auxiliary Losses: Included on all decoder layers and also encoder outputs.

Bipartite (Hungarian) matching of predictions to ground truths determines the allocation of supervision. Typical weights: λ1=5.0\lambda_1 = 5.0 (regression), λ2=2.0\lambda_2 = 2.0 (GIoU), focal α=0.25\alpha=0.25, γ=2\gamma=2.

Dataset Regimes

  • Pre-training datasets: Objects365 V1/V2 ($1.0$–$1.7$M), OpenImages-V6 ($1.5$M), V3Det (245K), GoldG (GQA+Flickr30K \sim150K), GRIT (\sim9M), plus CC/SBU/YFCC for weak supervision. Grounding-D20M (1.5) aggregates diverse web and public groundings.
  • Fine-tuning datasets: COCO, LVIS, RefCOCO/+/g, D³, Flickr30K Entities. Negative sampling is employed to mitigate hallucination.

Training Protocol

  • Optimization: AdamW optimizer, learning rate  ⁣104\sim\!10^{-4}, weight decay $0.05$–10410^{-4}.
  • Batching: $128$ on 32×309032\times3090 GPUs for MM-Grounding-DINO; $32$–$64$ across up to $64$ GPUs in earlier variants.
  • Duration: Pre-training up to $30$ epochs, fine-tuning "1×" = $12$ epochs. In Dynamic-DINO, MoE-tuning substitutes expensive pre-training, delivering superior accuracy with one-tenth the dataset.

4. Quantitative Outcomes and Ablation Insights

Grounding DINO achieves state-of-the-art performance on leading open-set detection and grounding benchmarks, and ablation studies clarify the contribution of each architectural element.

Performance Benchmarks

Model COCO AP (ZS) LVIS AP (ZS) RefCOCO/g Acc Edge FPS (TensorRT)
GLIP-T 46.6 26.7 50.4 / 43.1
G-DINO-T (O365+...) 48.4 28.8 50.8/51.6/60.4 9.4 / 42.6
MM-G-DINO-T (c3) 50.6 41.4 53.1/52.7/62.9
G-DINO-L 52.5 33.9 91.9/88.7/84.1
G-DINO 1.5 Pro 54.3 55.7
G-DINO 1.5 Edge 45.0 36.2 18.5 / 75.2
Dynamic-DINO-16x2 46.2 36.2 17.1 / 98.0

"ZS" = zero-shot (no target domain training); FPS on NVIDIA A100/TensorRT, 800×1333."

  • Grounding DINO and its 1.5 variants consistently outperform prior OVD/grounding models in mAP/AP and Recall, especially under zero-shot conditions.
  • Dynamic-DINO, fine-tuned on just 1.56M open-source images, matches or exceeds Grounding DINO 1.5 Edge trained on 20M, highlighting efficient data utilization (Lu et al., 23 Jul 2025).

Ablation Insights

  • Removal of encoder fusion, query selection, or text cross-attention each results in significant performance drops, especially on rare classes.
  • Deep multi-stage fusion and sub-sentence tokenization are critical for reliable phrase/object linkages.
  • Dynamic-DINO's MoE tuning provides monotonic gains with increased expert count (NN), while splitting into k>2k>2 may overfit on limited data. Router/expert initialization is critical for stable fine-tuning.

5. Open-Source Implementations and Reproducibility

  • MM-Grounding-DINO (MMDetection): Fully specifies architecture, optimizer, data pipeline, and training/inference scripts for end-to-end reproducibility (Zhao et al., 4 Jan 2024).
    • Example configuration specifies Swin-Transformer backbone, BERT text encoder, feature enhancer, and 900 queries for "Tiny" models.
  • Grounding DINO 1.5 API (IDEA Research): Provides ready model weights (Pro and Edge), /detect and /visualize endpoints, and parameterized Python API for customized detection and visualization (Ren et al., 16 May 2024).
    • Users specify detection/grounding thresholds and natural-language prompts at runtime.
  • Dynamic-DINO: Extends fast Edge variants with MoE tuning; open-sourced inference retains hardware efficiency while scaling total parameters.

6. Edge Deployment and Practical Considerations

Grounding DINO 1.5 Edge and Dynamic-DINO directly address the requirements of real-time and resource-constrained environments.

  • Pruning and Scaling: Edge models remove deformable attention in lower-level features, leverage EfficientViT, and consolidate multi-scale features for hardware simplicity. Dynamic-DINO dynamically gates expert FFNs, reducing active parameter count per sample.
  • Hardware Optimization: TensorRT kernel fusion yields up to 75 FPS on A100 GPUs; on Jetson Orin NX, Edge/ Dynamic-DINO maintain >>10 FPS at 640×640640 \times 640 input.
  • Memory and Compute: Edge models reduce memory footprint by up to 50\%. Dynamic-DINO increases the total parameter pool (\sim+25\%) without elevating per-sample compute.

A notable feature of Dynamic-DINO is the emergent specialization and cooperation patterns among decoder experts: shallow-layer experts participate broadly across tokens, while deep-layer experts form stable collaborative structures with only 2–3 partners, reflecting pattern specialization.

7. Extensions, Limitations, and Outlook

Grounding DINO robustly addresses open-vocabulary detection and grounding but exhibits several technical boundaries:

  • Strengths:
    • Tight cross-modality fusion at all model stages yields robust open-vocabulary performance.
    • NMS-free, fully end-to-end training and inference paradigm.
    • Demonstrated generalization across COCO, LVIS, ODinW, and multiple grounding/evaluation protocols.
    • Efficient edge deployment variants suitable for real-time batching and streaming.
  • Limitations:
    • No built-in segmentation head (only bounding boxes).
    • Scaling to very large models and datasets increases training complexity/requires further optimization (e.g., kernel fusion for MoE).
    • Slight drop on rare/long-tail categories, partially due to fixed query/token budget.
  • Suggested Future Directions:
    • Integration of segmentation mask heads for panoptic/instance segmentation.
    • Dynamic adjustment of query count for long-tailed visual domains.
    • Extension of MoE schemes to early encoder and text backbones.
    • Multimodal pre-training with broader data sources, such as videos and 3D environments.
    • Development of resource-aware routing and coarse-to-fine multi-stage expert selection for efficiency.

Grounding DINO exemplifies a comprehensive solution for general-purpose, language-driven open-set object detection and grounding. Its open-source derivatives and efficient tuning approaches, such as MoE-tuned Dynamic-DINO, continue to set the pace for deployable, accurate, and extensible vision-LLMs in both cloud and edge settings (Liu et al., 2023, Zhao et al., 4 Jan 2024, Ren et al., 16 May 2024, Lu et al., 23 Jul 2025).

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Grounding DINO.