An Open and Comprehensive Pipeline for Unified Object Grounding and Detection (2401.02361v2)

Published 4 Jan 2024 in cs.CV

Abstract: Grounding-DINO is a state-of-the-art open-set detection model that tackles multiple vision tasks including Open-Vocabulary Detection (OVD), Phrase Grounding (PG), and Referring Expression Comprehension (REC). Its effectiveness has led to its widespread adoption as a mainstream architecture for various downstream applications. However, despite its significance, the original Grounding-DINO model lacks comprehensive public technical details due to the unavailability of its training code. To bridge this gap, we present MM-Grounding-DINO, an open-source, comprehensive, and user-friendly baseline, which is built with the MMDetection toolbox. It adopts abundant vision datasets for pre-training and various detection and grounding datasets for fine-tuning. We give a comprehensive analysis of each reported result and detailed settings for reproduction. The extensive experiments on the benchmarks mentioned demonstrate that our MM-Grounding-DINO-Tiny outperforms the Grounding-DINO-Tiny baseline. We release all our models to the research community. Codes and trained models are released at https://github.com/open-mmlab/mmdetection/tree/main/configs/mm_grounding_dino.

References (35)

Authors (7)

Xiangyu Zhao (192 papers)
Yicheng Chen (24 papers)
Shilin Xu (17 papers)
Xiangtai Li (128 papers)
Xinjiang Wang (32 papers)
Yining Li (29 papers)
Haian Huang (8 papers)

Citations (18)

View on Semantic Scholar

Summary

The paper presents an open-source framework that unifies object detection, phrase grounding, and referring expression comprehension to advance multi-modal AI.
It employs diverse pre-training datasets and innovative modules like a language-guided query selection to enhance feature extraction and detection accuracy.
Experimental results on COCO and LVIS benchmarks show that even its Tiny configuration significantly outperforms baseline models.

Introduction to MM-Grounding-DINO

The Grounding-DINO model is recognized within the AI research community for its significant contributions to object detection, which involves identifying objects in images and matching them to textual descriptions. This involves addressing three sub-tasks: Open-Vocabulary Detection (OVD), Phrase Grounding (PG), and Referring Expression Comprehension (REC). Despite the impressive achievements of the original Grounding-DINO model, its training code is not publicly available, which presents challenges for researchers seeking to reproduce or further enhance the model. Seeking to resolve this, the novel MM-Grounding-DINO has been developed as an open-source alternative, leveraging the MMDetection toolbox. The MM-Grounding-DINO improves upon Grounding-DINO using a wider array of pre-training datasets and fine-tuning strategies, thus further elevating the model's performance across various benchmarks.

Enhancements and Evaluation

MM-Grounding-DINO inherits the strengths of Grounding-DINO, with modifications in its initialization to accommodate a more diverse training regimen. It utilizes a text and image backbone to extract features that are then enhanced through a feature enhancer module. The model's unique language-guided query selection module and cross-modality decoder are key components that contribute to its ability to accurately anchor textual descriptions to the visual elements within an image.

Comprehensive experiments demonstrate that MM-Grounding-DINO, even in its Tiny configuration, outperforms the Grounding-DINO-Tiny benchmark across multiple evaluation sets, including the challenging COCO and LVIS benchmarks. The fine-grained analysis reveals that the inclusion of additional datasets in the pre-training phase, such as GRIT and V3Det, contributed significantly to the model's performance uplift.

Datasets and Training

The introduction of MM-Grounding-DINO represents a pivotal step in addressing the diversity of datasets required for effective object detection. It involves meticulous preparation and categorization of fifteen distinct datasets grouped according to their annotation types and tasks (OVD, PG, and REC). The MM-Grounding-DINO framework adeptly manages the data from different sources by integrating them into a unified format, thereby simplifying the training process. In terms of its training procedure, MM-Grounding-DINO incorporates novel data augmentation strategies and input rules for text descriptions that enhance the model's ability to discern and localize objects described in various textual forms.

Conclusion and Future Work

The development of MM-Grounding-DINO marks a significant progression in the field of multi-modal AI, as it delivers an open-source framework that rivals and outstrips a well-established baseline in object detection tasks. By extending the benchmarks available for OVD, PG, and REC tasks, the researchers have set a new standard for system evaluation. The MM-Grounding-DINO's robust performance and accessibility through the open-source platform encourage further research and development. It is posited that the streamlined approach and superior benchmarks established by MM-Grounding-DINO will catalyze future innovations in the integration of vision and language modalities.

PDF Markdown

Related Papers

GitHub

GitHub - open-mmlab/mmdetection: OpenMMLab Detection Toolbox and Benchmark (28,250 stars)

Tweets

https://twitter.com/OpenMMLab/status/1745651472724095427

https://twitter.com/semisance/status/1743254562973089888