Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Grounding DINO 1.5: Advance the "Edge" of Open-Set Object Detection (2405.10300v2)

Published 16 May 2024 in cs.CV

Abstract: This paper introduces Grounding DINO 1.5, a suite of advanced open-set object detection models developed by IDEA Research, which aims to advance the "Edge" of open-set object detection. The suite encompasses two models: Grounding DINO 1.5 Pro, a high-performance model designed for stronger generalization capability across a wide range of scenarios, and Grounding DINO 1.5 Edge, an efficient model optimized for faster speed demanded in many applications requiring edge deployment. The Grounding DINO 1.5 Pro model advances its predecessor by scaling up the model architecture, integrating an enhanced vision backbone, and expanding the training dataset to over 20 million images with grounding annotations, thereby achieving a richer semantic understanding. The Grounding DINO 1.5 Edge model, while designed for efficiency with reduced feature scales, maintains robust detection capabilities by being trained on the same comprehensive dataset. Empirical results demonstrate the effectiveness of Grounding DINO 1.5, with the Grounding DINO 1.5 Pro model attaining a 54.3 AP on the COCO detection benchmark and a 55.7 AP on the LVIS-minival zero-shot transfer benchmark, setting new records for open-set object detection. Furthermore, the Grounding DINO 1.5 Edge model, when optimized with TensorRT, achieves a speed of 75.2 FPS while attaining a zero-shot performance of 36.2 AP on the LVIS-minival benchmark, making it more suitable for edge computing scenarios. Model examples and demos with API will be released at https://github.com/IDEA-Research/Grounding-DINO-1.5-API

Definition Search Book Streamline Icon: https://streamlinehq.com
References (37)
  1. Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts. CVPR, 2021.
  2. YOLO-World: Real-Time Open-Vocabulary Object Detection. CVPR, 2024.
  3. Evaluating Large-Vocabulary Object Detectors: The Devil is in the Details. arXiv preprint arXiv:2102.01066, 2022.
  4. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. ICLR, 2020.
  5. EVA-02: A Visual Representation for Neon Genesis. arXiv preprint arXiv:2303.11331, 2023.
  6. LVIS: A Dataset for Large Vocabulary Instance Segmentation, 2019.
  7. Deep Residual Learning for Image Recognition. CVPR, 2015.
  8. T-Rex2: Towards Generic Object Detection via Text-Visual Prompt Synergy. arXiv preprint arXiv:2403.14610, 2024.
  9. Ultralytics YOLOv8, January 2023.
  10. MDETR - Modulated Detection for End-to-End Multi-Modal Understanding. ICCV, 2021.
  11. Segment Anything. ICCV, 2023.
  12. The Open Images Dataset V4: Unified image classification, object detection, and visual relationship detection at scale. IJCV, 2020.
  13. Visual In-Context Prompting. CVPR, 2024.
  14. Lite DETR : An Interleaved Multi-Scale Encoder for Efficient DETR. CVPR, 2023.
  15. Grounded Language-Image Pre-training. CVPR, 2022.
  16. Microsoft COCO: Common Objects in Context. ECCV, 2014.
  17. Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection. arXiv preprint arXiv:2303.05499, 2023.
  18. EfficientViT: Memory Efficient Vision Transformer with Cascaded Group Attention. ICCV, 2023.
  19. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. ICCV, 2021.
  20. A ConvNet for the 2020s. CVPR, 2022.
  21. Scaling Open-Vocabulary Object Detection. NeurIPS, 2023.
  22. Simple Open-Vocabulary Object Detection with Vision Transformers. ECCV, 2022.
  23. Learning Transferable Visual Models From Natural Language Supervision. ICML, 2021.
  24. Aligning and Prompting Everything All at Once for Universal Visual Perception. CVPR, 2024.
  25. V3Det: Vast Vocabulary Visual Detection Dataset. ICCV, 2023.
  26. Detecting Everything in the Open World: Towards Universal Object Detection. CVPR, 2023.
  27. General Object Foundation Model for Images and Videos at Scale. CVPR, 2024.
  28. Multi-modal Queried Object Detection in the Wild. NeurIPS, 2023.
  29. DetCLIPv2: Scalable Open-Vocabulary Object Detection Pre-training via Word-Region Alignment. CVPR, 2023.
  30. DetCLIP: Dictionary-Enriched Visual-Concept Paralleled Pre-training for Open-world Detection. NeurIPS, 2022.
  31. DetCLIPv3: Towards Versatile Generative Open-vocabulary Object Detection. CVPR, 2024.
  32. Florence: A New Foundation Model for Computer Vision. arXiv preprint arXiv:2111.11432, 2021.
  33. DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection. ICLR, 2023.
  34. A Simple Framework for Open-Vocabulary Segmentation and Detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1020–1031, 2023.
  35. GLIPv2: Unifying Localization and Vision-Language Understanding. NeurIPS, 2022.
  36. Real-time Transformer-based Open-Vocabulary Detection with Efficient Fusion Head. arXiv preprint arXiv:2403.06892, 2024.
  37. DETRs Beat YOLOs on Real-time Object Detection. CVPR, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (16)
  1. Tianhe Ren (25 papers)
  2. Qing Jiang (30 papers)
  3. Shilong Liu (60 papers)
  4. Zhaoyang Zeng (29 papers)
  5. Wenlong Liu (12 papers)
  6. Han Gao (78 papers)
  7. Hongjie Huang (3 papers)
  8. Zhengyu Ma (25 papers)
  9. Xiaoke Jiang (11 papers)
  10. Yihao Chen (40 papers)
  11. Yuda Xiong (4 papers)
  12. Hao Zhang (948 papers)
  13. Feng Li (286 papers)
  14. Peijun Tang (6 papers)
  15. Kent Yu (3 papers)
  16. Lei Zhang (1691 papers)
Citations (11)

Summary

  • The paper demonstrates that Grounding DINO 1.5 Pro sets new accuracy records (54.3 AP on COCO and 55.7 AP on LVIS-minival) through a larger vision backbone and deep early fusion.
  • The paper shows that Grounding DINO 1.5 Edge achieves real-time detection at 75.2 FPS while maintaining robust zero-shot performance on edge devices.
  • The paper emphasizes scalability and generalization by training on over 20 million images and optimizing deployment for diverse and unpredictable environments.

Grounding DINO 1.5: Advancements in Open-Set Object Detection

What is Grounding DINO 1.5?

Grounding DINO 1.5 is a powerful suite of models designed by IDEA Research for open-set object detection. It includes two main models: Grounding DINO 1.5 Pro and Grounding DINO 1.5 Edge. The Pro model focuses on improving detection performance, while the Edge model optimizes for faster inference speeds, making it suitable for use in real-time applications on edge devices.

Notable Results

  • Grounding DINO 1.5 Pro
    • Achieved a 54.3 AP (Average Precision) on the COCO zero-shot transfer benchmark.
    • Set new records with a 55.7 AP on the LVIS-minival zero-shot benchmark.
  • Grounding DINO 1.5 Edge
    • Remarkably, it reached a speed of 75.2 FPS with a zero-shot performance of 36.2 AP on the LVIS-minival benchmark.
    • Optimized for edge devices, demonstrating its utility in real-time applications.

Key Improvements

Grounding DINO 1.5 Pro

  • Larger Vision Backbone: Uses the pre-trained ViT-L architecture for enhanced model performance.
  • Deep Early Fusion: Integrates language and image features more deeply before the decoding phase, leading to higher detection recall and better bounding box precision accuracy.
  • Larger Dataset: Trained on over 20 million images with grounding annotations, significantly enriching semantic understanding.

Grounding DINO 1.5 Edge

  • Efficient Feature Enhancer: Limits fusion to high-level image features, reducing computational demands while maintaining robust detection capabilities.
  • Optimized for Edge Deployment: Uses EfficientViT-L1 as the backbone and achieves over 10 FPS inference speed on an NVIDIA Orin NX platform.

Practical Implications

  1. Detection Accuracy: The Grounding DINO 1.5 Pro model sets new benchmarks in detection accuracy, making it valuable for applications requiring high precision.
  2. Real-time Applicability: The Grounding DINO 1.5 Edge model’s ability to maintain high detection performance while running efficiently on edge devices has significant implications for autonomous systems, medical imaging, and more.
  3. Scalability: Training on a large, diverse dataset means these models can generalize better to unseen categories, a critical factor for applications operating in varied and unpredictable environments.

Future Directions

  1. Enhanced Edge Deployment: Further research could focus on optimizing the models for even faster inference speeds and less resource-intensive deployments.
  2. Broader Dataset Integration: Incorporating even more diverse datasets could further improve the model's ability to generalize across different scenarios.
  3. Reducing Hallucinations: Continuing to refine early fusion techniques and sampling strategies could address the challenges of model hallucinations, enhancing the robustness and reliability of detections.

Conclusion

Grounding DINO 1.5 marks a significant step forward in the field of open-set object detection, with notable achievements both in terms of detection accuracy and real-time application feasibility. As AI continues to evolve, these advancements pave the way for more sophisticated and reliable computer vision systems that can operate seamlessly across a wide range of applications and environments.