Papers
Topics
Authors
Recent
2000 character limit reached

Grounding DINO 1.5 Edge

Updated 17 December 2025
  • The paper introduces Grounding DINO 1.5 Edge, a real-time open-set detector that balances effective detection accuracy with low-latency performance for edge deployment.
  • It employs a streamlined transformer architecture with an EfficientViT-L1 backbone and optimized single-scale cross-attention, achieving high throughput such as 75 FPS on an A100 GPU.
  • Extensive training on the Grounding-20M dataset with quantization and efficient inference pipelines ensures robust zero-shot detection ideal for robotics, mobile, and embedded systems.

Grounding DINO 1.5 Edge is a real-time, open-set object detector developed by IDEA Research and introduced as a lightweight member of the Grounding DINO 1.5 family, optimized for deployment on edge platforms. It targets applications demanding both low-latency inference and robust zero-shot performance, combining architectural efficiency with expansive dataset pre-training to advance the state of open-set object detection suitable for robotics, mobile, and embedded systems (Ren et al., 16 May 2024).

1. Architecture and Model Design

Grounding DINO 1.5 Edge adopts a streamlined transformer-based architecture, specifically designed to maximize the trade-off between detection accuracy and computational efficiency. The model’s backbone is EfficientViT-L1, which significantly reduces the parameter count to approximately 22 million while extracting three feature maps at spatial strides of 8, 16, and 32. The feature maps are denoted:

  • F3R48×H/8×W/8F_3 \in \mathbb{R}^{48 \times \lfloor H/8 \rfloor \times \lfloor W/8 \rfloor}
  • F4R96×H/16×W/16F_4 \in \mathbb{R}^{96 \times \lfloor H/16 \rfloor \times \lfloor W/16 \rfloor}
  • F5R192×H/32×W/32F_5 \in \mathbb{R}^{192 \times \lfloor H/32 \rfloor \times \lfloor W/32 \rfloor}

To minimize computational overhead, only the F5F_5 feature map participates in cross-attention with the language queries, reducing the cross-modal token count from 8,400 to 400 for a 640×640640 \times 640 input. Deformable self-attention is replaced with vanilla multi-head self-attention (8 heads, dmodel=256d_\text{model} = 256), and a lightweight cross-scale fusion module integrates F3F_3 and F4F_4 into the F5F_5 stream via addition and 1×11 \times 1 convolutions.

The detection head replicates that of Grounding DINO, employing a stack of Ndec=6N_\text{dec} = 6 transformer decoder layers operating on Q=300Q = 300 learnable object queries (dmodel=256d_\text{model} = 256). Decoder outputs are dispatched to parallel classification and box regression heads (classification: piRCtext+1p_i \in \mathbb{R}^{C_\text{text}+1}, box: b^iR4\hat{b}_i \in \mathbb{R}^4), totaling roughly 8 million parameters for the head. The full pipeline comprises approximately 35 million parameters and 15.6 GFLOPs per 640×640640 \times 640 image.

2. Training Data, Objectives, and Procedure

Grounding DINO 1.5 Edge is trained on the "Grounding-20M" dataset, containing over 20 million images with grounding annotations assembled from public sources. Importantly, no COCO or LVIS images are included in pre-training to ensure genuine zero-shot evaluation. Training episodes comprise a random mix of detection and grounding tasks, and employ negative-prompt sampling to mitigate hallucinated detections.

The optimization objective mirrors that of the original Grounding DINO:

L=Lcls+λ1Lbox+λ2Lalign\mathcal{L} = \mathcal{L}_{\text{cls}} + \lambda_1 \mathcal{L}_{\text{box}} + \lambda_2 \mathcal{L}_{\text{align}}

where

  • Lcls\mathcal{L}_{\text{cls}} is the focal loss for classification: iαt(1p^i)γlogp^i-\sum_i \alpha_t (1-\hat{p}_i)^\gamma \log \hat{p}_i
  • Lbox\mathcal{L}_{\text{box}} is the sum of 1\ell_1 and GIoU regression losses: n[bnb^n1+GIoU(bn,b^n)]\sum_n [\|b_n - \hat{b}_n\|_1 + \text{GIoU}(b_n, \hat{b}_n)]
  • Lalign\mathcal{L}_{\text{align}} is a contrastive loss for grounding: i,jyijlogσ(fI(xi),fT(tj)/τ)-\sum_{i,j} y_{ij} \log \sigma(\langle f_I(x_i), f_T(t_j) \rangle / \tau)

3. Edge Deployment and Efficiency Optimization

The design of Grounding DINO 1.5 Edge incorporates several edge-specific optimizations:

  • Single-Scale Cross Attention: Only F5F_5 is cross-fused, avoiding expensive multi-scale attention.
  • Standard Multi-Head Self-Attention: Vanilla attention increases compatibility with deployment toolchains such as TensorRT.
  • Quantization: The model is converted from FP32 to INT8 via TensorRT’s calibration tool, significantly reducing memory and compute requirements without accuracy degradation.
  • Export and Inference Pipeline: The PyTorch-trained model is exported via ONNX (static 640×640640 \times 640 shape) and further optimized with TensorRT for layer/kernel fusion and memory planning. INT8 calibration employs 128 images from Grounding-20M.

No pruning or knowledge distillation is employed; efficiency arises strictly from model design and quantization. With these optimizations, Grounding DINO 1.5 Edge achieves high throughput, exemplified by 75.2 FPS on an A100 GPU (TensorRT, FP32), compared to 18.5 FPS in native PyTorch.

4. Empirical Evaluation and Benchmarking

Evaluation is conducted in a zero-shot setting. Key benchmark results, using an input size of 800×1333800 \times 1333 (unless noted), include:

  • COCO detection: APall_\text{all} = 45.0
  • LVIS-minival: APall_\text{all} = 36.2
  • LVIS-val: APall_\text{all} = 29.3

Inference speed benchmarks are reported as:

  • A100 GPU: 18.5 FPS (PyTorch) / 75.2 FPS (TensorRT)
  • NVIDIA Orin NX: 5.5 FPS

Comparative results with other leading edge-capable detectors:

Model Backbone COCO AP LVIS-min AP FPS (A100/TensorRT) FPS (Orin NX)
OmDet-Turbo-T Swin-T 42.5 30.3 21.5 / 140.0
YOLO-World v2-L YOLOv8-L 33.0 37.4 / –
Grounding DINO-T (base) Swin-T 48.4 27.4 9.4 / 42.6 1.1
Grounding DINO 1.5 Edge (640) EfficientViT-L1 42.9 33.5 21.7 / 111.6 10.7
Grounding DINO 1.5 Edge (800) EfficientViT-L1 45.0 36.2 18.5 / 75.2 5.5

Grounding DINO 1.5 Edge demonstrates superior speed-accuracy trade-off, achieving zero-shot AP competitive with larger, slower models (Ren et al., 16 May 2024).

5. Accuracy–Efficiency Trade-offs and Applicability

Grounding DINO 1.5 Edge was engineered to balance detection accuracy with real-time performance demands on moderate compute. Compared to Grounding DINO 1.5 Pro (54.3 AP on COCO, <<1 FPS on A100), Edge achieves \sim45 AP at 75 FPS under TensorRT—a \sim4.5×\times decrease in average precision for a \sim75×\times increase in throughput. This renders the model optimal for scenarios where low latency (<<20 ms) is a critical operational threshold and AP in the 35–45 range is sufficient, such as closed-loop robotics, mobile/embedded vision, AR/VR, and visual search.

6. Influence and Extensions

Grounding DINO 1.5 Edge serves as the architectural substrate for subsequent advances in real-time open-vocabulary detection. Notably, "Dynamic-DINO: Fine-Grained Mixture of Experts Tuning for Real-time Open-Vocabulary Object Detection" extends the backbone and decoder via a conditional-computation mixture-of-experts (MoE) scheme. The MoE adaptation keeps inference FLOPs constant but achieves higher zero-shot accuracy—e.g., Dynamic-DINO (16 experts, top-2 gating) scores APbox_\text{box} 43.7 on COCO-val (vs. 42.6 for Edge) and APall_\text{all} 33.6 on LVIS-minival (vs. 31.1 for Edge, under identical open data), demonstrating the adaptability and extensibility of the base Edge model (Lu et al., 23 Jul 2025).

Grounding DINO 1.5 Edge’s design, combining the EfficientViT-L1 backbone, single-scale cross-attention fusion, and transformer-based detection head, exemplifies the "sweet-spot" for performant open-set detection on edge hardware, providing a scalable foundation for further research and application (Ren et al., 16 May 2024, Lu et al., 23 Jul 2025).

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Grounding DINO 1.5 Edge.