Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

An Open and Comprehensive Pipeline for Unified Object Grounding and Detection (2401.02361v2)

Published 4 Jan 2024 in cs.CV

Abstract: Grounding-DINO is a state-of-the-art open-set detection model that tackles multiple vision tasks including Open-Vocabulary Detection (OVD), Phrase Grounding (PG), and Referring Expression Comprehension (REC). Its effectiveness has led to its widespread adoption as a mainstream architecture for various downstream applications. However, despite its significance, the original Grounding-DINO model lacks comprehensive public technical details due to the unavailability of its training code. To bridge this gap, we present MM-Grounding-DINO, an open-source, comprehensive, and user-friendly baseline, which is built with the MMDetection toolbox. It adopts abundant vision datasets for pre-training and various detection and grounding datasets for fine-tuning. We give a comprehensive analysis of each reported result and detailed settings for reproduction. The extensive experiments on the benchmarks mentioned demonstrate that our MM-Grounding-DINO-Tiny outperforms the Grounding-DINO-Tiny baseline. We release all our models to the research community. Codes and trained models are released at https://github.com/open-mmlab/mmdetection/tree/main/configs/mm_grounding_dino.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (35)
  1. Roboflow 100. people in paintings dataset. https://universe.roboflow.com/roboflow-100/people-in-paintings, 2023. visited on 2023-12-21.
  2. AABBCCEEFFGG. Brain tumor detection dataset. https://universe.roboflow.com/aabbcceeffgg/brain-tumor-detection-69d9s, 2022. visited on 2023-12-21.
  3. End-to-end object detection with transformers, 2020.
  4. MMDetection: Open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155, 2019.
  5. The cityscapes dataset for semantic urban scene understanding, 2016.
  6. Bert: Pre-training of deep bidirectional transformers for language understanding, 2019.
  7. Rethinking general underwater object detection: Datasets, challenges, and solutions. Neurocomputing, 517:243–256, 2023.
  8. Lvis: A dataset for large vocabulary instance segmentation, 2019.
  9. Grec: Generalized referring expression comprehension, 2023.
  10. spaCy: Industrial-strength Natural Language Processing in Python. 2020.
  11. Gqa: A new dataset for real-world visual reasoning and compositional question answering. Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  12. Mdetr – modulated detection for end-to-end multi-modal understanding, 2021.
  13. ReferItGame: Referring to objects in photographs of natural scenes. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 787–798, Doha, Qatar, 2014. Association for Computational Linguistics.
  14. Benchmarking single image dehazing and beyond, 2019.
  15. Elevater: A benchmark and toolkit for evaluating language-augmented visual models, 2022a.
  16. Grounded language-image pre-training, 2022b.
  17. Microsoft coco: Common objects in context, 2015.
  18. Focal loss for dense object detection, 2018.
  19. Gres: Generalized referring expression segmentation, 2023a.
  20. Grounding dino: Marrying dino with grounded pre-training for open-set object detection, 2023b.
  21. Swin transformer: Hierarchical vision transformer using shifted windows, 2021.
  22. Generation and comprehension of unambiguous object descriptions, 2016.
  23. Kosmos-2: Grounding multimodal large language models to the world, 2023.
  24. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. IJCV, 123(1):74–93, 2017.
  25. Learning transferable visual models from natural language supervision, 2021.
  26. Generalized intersection over union: A metric and a loss for bounding box regression, 2019.
  27. Objects365: A large-scale, high-quality dataset for object detection. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 8429–8438, 2019.
  28. V3det: Vast vocabulary visual detection dataset, 2023.
  29. Towards open vocabulary learning: A survey.
  30. Described object detection: Liberating object detection with flexible expressions, 2023.
  31. Recognize any regions, 2023.
  32. Cascade-detr: Delving into high-quality universal object detection, 2023.
  33. Modeling context in referring expressions, 2016.
  34. Dino: Detr with improved denoising anchor boxes for end-to-end object detection, 2022.
  35. Wei Li Zuwei Long. Open grounding dino:the third party implementation of the paper grounding dino. https://github.com/longzw1997/Open-GroundingDino, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Xiangyu Zhao (192 papers)
  2. Yicheng Chen (24 papers)
  3. Shilin Xu (17 papers)
  4. Xiangtai Li (128 papers)
  5. Xinjiang Wang (32 papers)
  6. Yining Li (29 papers)
  7. Haian Huang (8 papers)
Citations (18)

Summary

  • The paper presents an open-source framework that unifies object detection, phrase grounding, and referring expression comprehension to advance multi-modal AI.
  • It employs diverse pre-training datasets and innovative modules like a language-guided query selection to enhance feature extraction and detection accuracy.
  • Experimental results on COCO and LVIS benchmarks show that even its Tiny configuration significantly outperforms baseline models.

Introduction to MM-Grounding-DINO

The Grounding-DINO model is recognized within the AI research community for its significant contributions to object detection, which involves identifying objects in images and matching them to textual descriptions. This involves addressing three sub-tasks: Open-Vocabulary Detection (OVD), Phrase Grounding (PG), and Referring Expression Comprehension (REC). Despite the impressive achievements of the original Grounding-DINO model, its training code is not publicly available, which presents challenges for researchers seeking to reproduce or further enhance the model. Seeking to resolve this, the novel MM-Grounding-DINO has been developed as an open-source alternative, leveraging the MMDetection toolbox. The MM-Grounding-DINO improves upon Grounding-DINO using a wider array of pre-training datasets and fine-tuning strategies, thus further elevating the model's performance across various benchmarks.

Enhancements and Evaluation

MM-Grounding-DINO inherits the strengths of Grounding-DINO, with modifications in its initialization to accommodate a more diverse training regimen. It utilizes a text and image backbone to extract features that are then enhanced through a feature enhancer module. The model's unique language-guided query selection module and cross-modality decoder are key components that contribute to its ability to accurately anchor textual descriptions to the visual elements within an image.

Comprehensive experiments demonstrate that MM-Grounding-DINO, even in its Tiny configuration, outperforms the Grounding-DINO-Tiny benchmark across multiple evaluation sets, including the challenging COCO and LVIS benchmarks. The fine-grained analysis reveals that the inclusion of additional datasets in the pre-training phase, such as GRIT and V3Det, contributed significantly to the model's performance uplift.

Datasets and Training

The introduction of MM-Grounding-DINO represents a pivotal step in addressing the diversity of datasets required for effective object detection. It involves meticulous preparation and categorization of fifteen distinct datasets grouped according to their annotation types and tasks (OVD, PG, and REC). The MM-Grounding-DINO framework adeptly manages the data from different sources by integrating them into a unified format, thereby simplifying the training process. In terms of its training procedure, MM-Grounding-DINO incorporates novel data augmentation strategies and input rules for text descriptions that enhance the model's ability to discern and localize objects described in various textual forms.

Conclusion and Future Work

The development of MM-Grounding-DINO marks a significant progression in the field of multi-modal AI, as it delivers an open-source framework that rivals and outstrips a well-established baseline in object detection tasks. By extending the benchmarks available for OVD, PG, and REC tasks, the researchers have set a new standard for system evaluation. The MM-Grounding-DINO's robust performance and accessibility through the open-source platform encourage further research and development. It is posited that the streamlined approach and superior benchmarks established by MM-Grounding-DINO will catalyze future innovations in the integration of vision and language modalities.

Github Logo Streamline Icon: https://streamlinehq.com