Grounding DINO: Transformer for Open-Vocabulary Detection

Updated 16 November 2025

The paper introduces a transformer-based framework that fuses visual and language features to achieve robust open-set detection, phrase grounding, and referring expression comprehension.
It leverages multi-stage cross-modal fusion with a Swin-Transformer or EfficientViT backbone and BERT text encoder, eliminating the need for non-maximum suppression in a fully end-to-end design.
The model variants, including Grounding DINO 1.5 and Dynamic-DINO, demonstrate state-of-the-art zero-shot detection and efficient real-time deployment with significant architectural innovations.

Grounding DINO is a transformer-based open-set object detection framework that achieves robust performance on open-vocabulary detection (OVD), phrase grounding (PG), and referring expression comprehension (REC) by performing deep cross-modal fusion of vision and language signals. Initially developed by IDEA Research, it forms a core architecture widely adopted for open-vocabulary and grounding tasks. Grounding DINO and its follow-ons, including Grounding DINO 1.5 and Dynamic-DINO, set new benchmarks in zero-shot detection and efficient real-time deployment, notably extending detection to arbitrary categories specified in free-form text.

1. Foundations and Architectural Principles

Grounding DINO builds on the DINO detector ("DETR with Improved NOise-robust anchor boxes"), a non-anchor, end-to-end transformer detector. The core architectural advance of Grounding DINO is the introduction of multi-stage, "tight" fusion of visual and language representations throughout the vision backbone, query initialization, and transformer decoder stack (Liu et al., 2023).

Major Components

Image Backbone: Swin-Transformer (Tiny/Large) or more recently EfficientViT/ViT-L, extracting multi-scale image features $\{F_1, ..., F_4\}$ .
Text Backbone: BERT-uncased, encoding variable-length token embeddings $T \in \mathbb{R}^{L \times d}$ .
Feature Enhancer: Sequential bi-attention and self-attention blocks align language and vision features at multiple scales, including both image → text and text → image attention.
Language-Guided Query Selection: Top-K (e.g., $K=900$ in original, $K=100$ in 1.5 Pro/Edge) image token embeddings, judged most relevant to the text prompt via cosine similarity, are used as decoder "position" queries ensuring grounding sensitivity at initialization.
Cross-Modality Decoder: Each transformer decoder layer performs (1) self-attention over queries, (2) cross-attention to vision features, (3) cross-attention to language features, and (4) feed-forward transformation. Deep fusion at this stage aligns each query to both image and phrase context.
Prediction Head: The output from the decoder yields bounding box predictions $B \in \mathbb{R}^{K \times 4}$ and classification logits $C \in \mathbb{R}^{K \times |L_\text{text}+1|}$ .
End-to-End Detection: No non-maximum suppression (NMS), with one-to-one Hungarian bipartite matching between queries and ground-truth objects.

This architecture is fully end-to-end differentiable and fundamentally NMS-free. The selection of queries guided by language ensures responsiveness to open-vocabulary prompts and reduces the reliance on fixed, learned queries.

2. Model Variants: From Grounding DINO to Dynamic-DINO

The evolutionary pathway from Grounding DINO (original, 1.0) through Grounding DINO 1.5 (Pro and Edge) to Dynamic-DINO involves architectural modifications targeting both accuracy and efficiency.

Grounding DINO 1.5 Pro: Employs a larger ViT-L backbone, deep early-fusion, and is trained on >20 million grounding-annotated images. Decoder uses $L=6$ layers, $d=256$ , $N=100$ queries.
Grounding DINO 1.5 Edge: Optimized for edge deployment. EfficientViT-L1 backbone, single-scale early fusion, efficient feature enhancer (EFE) that cross-attends only the deepest image features to language, and light cross-scale upsampling for shallower layers.
Design Reductions: The Edge variant removes heavy deformable attention on non-final feature maps, reducing memory and compute by 30–50% relative to Pro.

Mixture-of-Experts (MoE) Decoder: The decoder’s FFNs are replaced by a conditional MoE supernet. Each original FFN is decomposed into $N$ groups, each with $k$ expert sub-FFNs, amounting to $kN$ experts per layer.
Dynamic Routing: At inference, for each input token, a learned router activates only the $k$ most relevant experts (usually $k=2$ ), maintaining active compute equivalent to the vanilla dense decoder but leveraging a much larger parameter pool.
Expert Initialization: Pre-trained weights are partitioned/sliced such that their sum exactly reproduces the pre-trained FFN output, and the router is initialized to activate all $k$ experts from the same FFN group, avoiding cold-start degradation.
Auxiliary Load-Balancing Loss: Encourages diverse expert utilization, avoiding router collapse to a few dominant experts.
Efficiency: Inference cost per token remains constant; only the pool of parameters grows. On edge devices, Dynamic-DINO maintains real-time inference rates with marginally reduced FPS.

3. Training Objectives, Data, and Protocols

Grounding DINO and its descendants employ multiple objectives tailored for open-set detection, grounding, and referring comprehension tasks.

Loss Functions

The overall loss per image-text pair is a sum over matched query-ground-truth pairs:

Classification (Contrastive Focal Loss):

$\mathrm{FocalLoss}(p_t) = -\alpha_t (1 - p_t)^\gamma \log(p_t)$

where $p_t$ is the predicted probability of the true label.

Box Regression: Sum of $L_1$ and GIoU losses:

$L_1(B_i, b_j) = || B_i - b_j ||_1 \qquad L_\text{GIoU}(B_i, b_j) = 1 - \mathrm{GIoU}(B_i, b_j)$

Grounding-specific Loss (1.5): Additional binary cross-entropy on grounding scores.
Auxiliary Losses: Included on all decoder layers and also encoder outputs.

Bipartite (Hungarian) matching of predictions to ground truths determines the allocation of supervision. Typical weights: $\lambda_1 = 5.0$ (regression), $\lambda_2 = 2.0$ (GIoU), focal $\alpha=0.25$ , $\gamma=2$ .

Dataset Regimes

Pre-training datasets: Objects365 V1/V2 ($1.0$–$1.7$M), OpenImages-V6 ($1.5$M), V3Det (245K), GoldG (GQA+Flickr30K $\sim$ 150K), GRIT ( $\sim$ 9M), plus CC/SBU/YFCC for weak supervision. Grounding-D20M (1.5) aggregates diverse web and public groundings.
Fine-tuning datasets: COCO, LVIS, RefCOCO/+/g, D³, Flickr30K Entities. Negative sampling is employed to mitigate hallucination.

Training Protocol

Optimization: AdamW optimizer, learning rate $\sim\!10^{-4}$ , weight decay $0.05$– $10^{-4}$ .
Batching: $128$ on $32\times3090$ GPUs for MM-Grounding-DINO; $32$–$64$ across up to $64$ GPUs in earlier variants.
Duration: Pre-training up to $30$ epochs, fine-tuning "1×" = $12$ epochs. In Dynamic-DINO, MoE-tuning substitutes expensive pre-training, delivering superior accuracy with one-tenth the dataset.

4. Quantitative Outcomes and Ablation Insights

Grounding DINO achieves state-of-the-art performance on leading open-set detection and grounding benchmarks, and ablation studies clarify the contribution of each architectural element.

Performance Benchmarks

Model	COCO AP (ZS)	LVIS AP (ZS)	RefCOCO/g Acc	Edge FPS (TensorRT)
GLIP-T	46.6	26.7	50.4 / 43.1	—
G-DINO-T (O365+...)	48.4	28.8	50.8/51.6/60.4	9.4 / 42.6
MM-G-DINO-T (c3)	50.6	41.4	53.1/52.7/62.9	—
G-DINO-L	52.5	33.9	91.9/88.7/84.1	—
G-DINO 1.5 Pro	54.3	55.7	—	—
G-DINO 1.5 Edge	45.0	36.2	—	18.5 / 75.2
Dynamic-DINO-16x2	46.2	36.2	—	17.1 / 98.0

"ZS" = zero-shot (no target domain training); FPS on NVIDIA A100/TensorRT, 800×1333."

Grounding DINO and its 1.5 variants consistently outperform prior OVD/grounding models in mAP/AP and Recall, especially under zero-shot conditions.
Dynamic-DINO, fine-tuned on just 1.56M open-source images, matches or exceeds Grounding DINO 1.5 Edge trained on 20M, highlighting efficient data utilization (Lu et al., 23 Jul 2025).

Ablation Insights

Removal of encoder fusion, query selection, or text cross-attention each results in significant performance drops, especially on rare classes.
Deep multi-stage fusion and sub-sentence tokenization are critical for reliable phrase/object linkages.
Dynamic-DINO's MoE tuning provides monotonic gains with increased expert count ( $N$ ), while splitting into $k>2$ may overfit on limited data. Router/expert initialization is critical for stable fine-tuning.

5. Open-Source Implementations and Reproducibility

MM-Grounding-DINO (MMDetection): Fully specifies architecture, optimizer, data pipeline, and training/inference scripts for end-to-end reproducibility (Zhao et al., 2024).
- Example configuration specifies Swin-Transformer backbone, BERT text encoder, feature enhancer, and 900 queries for "Tiny" models.
Grounding DINO 1.5 API (IDEA Research): Provides ready model weights (Pro and Edge), /detect and /visualize endpoints, and parameterized Python API for customized detection and visualization (Ren et al., 2024).
- Users specify detection/grounding thresholds and natural-language prompts at runtime.
Dynamic-DINO: Extends fast Edge variants with MoE tuning; open-sourced inference retains hardware efficiency while scaling total parameters.

6. Edge Deployment and Practical Considerations

Grounding DINO 1.5 Edge and Dynamic-DINO directly address the requirements of real-time and resource-constrained environments.

Pruning and Scaling: Edge models remove deformable attention in lower-level features, leverage EfficientViT, and consolidate multi-scale features for hardware simplicity. Dynamic-DINO dynamically gates expert FFNs, reducing active parameter count per sample.
Hardware Optimization: TensorRT kernel fusion yields up to 75 FPS on A100 GPUs; on Jetson Orin NX, Edge/ Dynamic-DINO maintain $>$ 10 FPS at $640 \times 640$ input.
Memory and Compute: Edge models reduce memory footprint by up to 50\%. Dynamic-DINO increases the total parameter pool ( $\sim$ +25\%) without elevating per-sample compute.

A notable feature of Dynamic-DINO is the emergent specialization and cooperation patterns among decoder experts: shallow-layer experts participate broadly across tokens, while deep-layer experts form stable collaborative structures with only 2–3 partners, reflecting pattern specialization.

7. Extensions, Limitations, and Outlook

Grounding DINO robustly addresses open-vocabulary detection and grounding but exhibits several technical boundaries:

Strengths:
- Tight cross-modality fusion at all model stages yields robust open-vocabulary performance.
- NMS-free, fully end-to-end training and inference paradigm.
- Demonstrated generalization across COCO, LVIS, ODinW, and multiple grounding/evaluation protocols.
- Efficient edge deployment variants suitable for real-time batching and streaming.
Limitations:
- No built-in segmentation head (only bounding boxes).
- Scaling to very large models and datasets increases training complexity/requires further optimization (e.g., kernel fusion for MoE).
- Slight drop on rare/long-tail categories, partially due to fixed query/token budget.
Suggested Future Directions:
- Integration of segmentation mask heads for panoptic/instance segmentation.
- Dynamic adjustment of query count for long-tailed visual domains.
- Extension of MoE schemes to early encoder and text backbones.
- Multimodal pre-training with broader data sources, such as videos and 3D environments.
- Development of resource-aware routing and coarse-to-fine multi-stage expert selection for efficiency.

Grounding DINO exemplifies a comprehensive solution for general-purpose, language-driven open-set object detection and grounding. Its open-source derivatives and efficient tuning approaches, such as MoE-tuned Dynamic-DINO, continue to set the pace for deployable, accurate, and extensible vision-LLMs in both cloud and edge settings (Liu et al., 2023, Zhao et al., 2024, Ren et al., 2024, Lu et al., 23 Jul 2025).

PDF Markdown Chat (Pro)

References (4)

Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection (2023)

Grounding DINO 1.5: Advance the "Edge" of Open-Set Object Detection (2024)

Dynamic-DINO: Fine-Grained Mixture of Experts Tuning for Real-time Open-Vocabulary Object Detection (2025)

An Open and Comprehensive Pipeline for Unified Object Grounding and Detection (2024)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Grounding DINO.

Grounding DINO: Transformer for Open-Vocabulary Detection

1. Foundations and Architectural Principles

Major Components

2. Model Variants: From Grounding DINO to Dynamic-DINO

Grounding DINO 1.5 (Ren et al., 2024)

Dynamic-DINO (Lu et al., 23 Jul 2025)

3. Training Objectives, Data, and Protocols

Loss Functions

Dataset Regimes

Training Protocol

4. Quantitative Outcomes and Ablation Insights

Performance Benchmarks

Ablation Insights

5. Open-Source Implementations and Reproducibility

6. Edge Deployment and Practical Considerations

7. Extensions, Limitations, and Outlook

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Grounding DINO: Transformer for Open-Vocabulary Detection

1. Foundations and Architectural Principles

Major Components

2. Model Variants: From Grounding DINO to Dynamic-DINO

Grounding DINO 1.5 (Ren et al., 2024)

Dynamic-DINO (Lu et al., 23 Jul 2025)

3. Training Objectives, Data, and Protocols

Loss Functions

Dataset Regimes

Training Protocol

4. Quantitative Outcomes and Ablation Insights

Performance Benchmarks

Ablation Insights

5. Open-Source Implementations and Reproducibility

6. Edge Deployment and Practical Considerations

7. Extensions, Limitations, and Outlook

Sponsor

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Related Topics