Grounding DINO 1.5 Edge

Updated 17 December 2025

The paper introduces Grounding DINO 1.5 Edge, a real-time open-set detector that balances effective detection accuracy with low-latency performance for edge deployment.
It employs a streamlined transformer architecture with an EfficientViT-L1 backbone and optimized single-scale cross-attention, achieving high throughput such as 75 FPS on an A100 GPU.
Extensive training on the Grounding-20M dataset with quantization and efficient inference pipelines ensures robust zero-shot detection ideal for robotics, mobile, and embedded systems.

Grounding DINO 1.5 Edge is a real-time, open-set object detector developed by IDEA Research and introduced as a lightweight member of the Grounding DINO 1.5 family, optimized for deployment on edge platforms. It targets applications demanding both low-latency inference and robust zero-shot performance, combining architectural efficiency with expansive dataset pre-training to advance the state of open-set object detection suitable for robotics, mobile, and embedded systems (Ren et al., 16 May 2024).

1. Architecture and Model Design

Grounding DINO 1.5 Edge adopts a streamlined transformer-based architecture, specifically designed to maximize the trade-off between detection accuracy and computational efficiency. The model’s backbone is EfficientViT-L1, which significantly reduces the parameter count to approximately 22 million while extracting three feature maps at spatial strides of 8, 16, and 32. The feature maps are denoted:

$F_3 \in \mathbb{R}^{48 \times \lfloor H/8 \rfloor \times \lfloor W/8 \rfloor}$
$F_4 \in \mathbb{R}^{96 \times \lfloor H/16 \rfloor \times \lfloor W/16 \rfloor}$
$F_5 \in \mathbb{R}^{192 \times \lfloor H/32 \rfloor \times \lfloor W/32 \rfloor}$

To minimize computational overhead, only the $F_5$ feature map participates in cross-attention with the language queries, reducing the cross-modal token count from 8,400 to 400 for a $640 \times 640$ input. Deformable self-attention is replaced with vanilla multi-head self-attention (8 heads, $d_\text{model} = 256$ ), and a lightweight cross-scale fusion module integrates $F_3$ and $F_4$ into the $F_5$ stream via addition and $1 \times 1$ convolutions.

The detection head replicates that of Grounding DINO, employing a stack of $N_\text{dec} = 6$ transformer decoder layers operating on $Q = 300$ learnable object queries ( $d_\text{model} = 256$ ). Decoder outputs are dispatched to parallel classification and box regression heads (classification: $p_i \in \mathbb{R}^{C_\text{text}+1}$ , box: $\hat{b}_i \in \mathbb{R}^4$ ), totaling roughly 8 million parameters for the head. The full pipeline comprises approximately 35 million parameters and 15.6 GFLOPs per $640 \times 640$ image.

2. Training Data, Objectives, and Procedure

Grounding DINO 1.5 Edge is trained on the "Grounding-20M" dataset, containing over 20 million images with grounding annotations assembled from public sources. Importantly, no COCO or LVIS images are included in pre-training to ensure genuine zero-shot evaluation. Training episodes comprise a random mix of detection and grounding tasks, and employ negative-prompt sampling to mitigate hallucinated detections.

The optimization objective mirrors that of the original Grounding DINO:

$\mathcal{L} = \mathcal{L}_{\text{cls}} + \lambda_1 \mathcal{L}_{\text{box}} + \lambda_2 \mathcal{L}_{\text{align}}$

where

$\mathcal{L}_{\text{cls}}$ is the focal loss for classification: $-\sum_i \alpha_t (1-\hat{p}_i)^\gamma \log \hat{p}_i$
$\mathcal{L}_{\text{box}}$ is the sum of $\ell_1$ and GIoU regression losses: $\sum_n [\|b_n - \hat{b}_n\|_1 + \text{GIoU}(b_n, \hat{b}_n)]$
$\mathcal{L}_{\text{align}}$ is a contrastive loss for grounding: $-\sum_{i,j} y_{ij} \log \sigma(\langle f_I(x_i), f_T(t_j) \rangle / \tau)$

3. Edge Deployment and Efficiency Optimization

The design of Grounding DINO 1.5 Edge incorporates several edge-specific optimizations:

Single-Scale Cross Attention: Only $F_5$ is cross-fused, avoiding expensive multi-scale attention.
Standard Multi-Head Self-Attention: Vanilla attention increases compatibility with deployment toolchains such as TensorRT.
Quantization: The model is converted from FP32 to INT8 via TensorRT’s calibration tool, significantly reducing memory and compute requirements without accuracy degradation.
Export and Inference Pipeline: The PyTorch-trained model is exported via ONNX (static $640 \times 640$ shape) and further optimized with TensorRT for layer/kernel fusion and memory planning. INT8 calibration employs 128 images from Grounding-20M.

No pruning or knowledge distillation is employed; efficiency arises strictly from model design and quantization. With these optimizations, Grounding DINO 1.5 Edge achieves high throughput, exemplified by 75.2 FPS on an A100 GPU (TensorRT, FP32), compared to 18.5 FPS in native PyTorch.

4. Empirical Evaluation and Benchmarking

Evaluation is conducted in a zero-shot setting. Key benchmark results, using an input size of $800 \times 1333$ (unless noted), include:

COCO detection: AP $_\text{all}$ = 45.0
LVIS-minival: AP $_\text{all}$ = 36.2
LVIS-val: AP $_\text{all}$ = 29.3

Inference speed benchmarks are reported as:

A100 GPU: 18.5 FPS (PyTorch) / 75.2 FPS (TensorRT)
NVIDIA Orin NX: 5.5 FPS

Comparative results with other leading edge-capable detectors:

Model	Backbone	COCO AP	LVIS-min AP	FPS (A100/TensorRT)	FPS (Orin NX)
OmDet-Turbo-T	Swin-T	42.5	30.3	21.5 / 140.0	–
YOLO-World v2-L	YOLOv8-L	–	33.0	37.4 / –	–
Grounding DINO-T (base)	Swin-T	48.4	27.4	9.4 / 42.6	1.1
Grounding DINO 1.5 Edge (640)	EfficientViT-L1	42.9	33.5	21.7 / 111.6	10.7
Grounding DINO 1.5 Edge (800)	EfficientViT-L1	45.0	36.2	18.5 / 75.2	5.5

Grounding DINO 1.5 Edge demonstrates superior speed-accuracy trade-off, achieving zero-shot AP competitive with larger, slower models (Ren et al., 16 May 2024).

5. Accuracy–Efficiency Trade-offs and Applicability

Grounding DINO 1.5 Edge was engineered to balance detection accuracy with real-time performance demands on moderate compute. Compared to Grounding DINO 1.5 Pro (54.3 AP on COCO, $<$ 1 FPS on A100), Edge achieves $\sim$ 45 AP at 75 FPS under TensorRT—a $\sim$ 4.5 $\times$ decrease in average precision for a $\sim$ 75 $\times$ increase in throughput. This renders the model optimal for scenarios where low latency ( $<$ 20 ms) is a critical operational threshold and AP in the 35–45 range is sufficient, such as closed-loop robotics, mobile/embedded vision, AR/VR, and visual search.

6. Influence and Extensions

Grounding DINO 1.5 Edge serves as the architectural substrate for subsequent advances in real-time open-vocabulary detection. Notably, "Dynamic-DINO: Fine-Grained Mixture of Experts Tuning for Real-time Open-Vocabulary Object Detection" extends the backbone and decoder via a conditional-computation mixture-of-experts (MoE) scheme. The MoE adaptation keeps inference FLOPs constant but achieves higher zero-shot accuracy—e.g., Dynamic-DINO (16 experts, top-2 gating) scores AP $_\text{box}$ 43.7 on COCO-val (vs. 42.6 for Edge) and AP $_\text{all}$ 33.6 on LVIS-minival (vs. 31.1 for Edge, under identical open data), demonstrating the adaptability and extensibility of the base Edge model (Lu et al., 23 Jul 2025).

Grounding DINO 1.5 Edge’s design, combining the EfficientViT-L1 backbone, single-scale cross-attention fusion, and transformer-based detection head, exemplifies the "sweet-spot" for performant open-set detection on edge hardware, providing a scalable foundation for further research and application (Ren et al., 16 May 2024, Lu et al., 23 Jul 2025).