An Open and Comprehensive Pipeline for Unified Object Grounding and Detection (2401.02361v2)
Abstract: Grounding-DINO is a state-of-the-art open-set detection model that tackles multiple vision tasks including Open-Vocabulary Detection (OVD), Phrase Grounding (PG), and Referring Expression Comprehension (REC). Its effectiveness has led to its widespread adoption as a mainstream architecture for various downstream applications. However, despite its significance, the original Grounding-DINO model lacks comprehensive public technical details due to the unavailability of its training code. To bridge this gap, we present MM-Grounding-DINO, an open-source, comprehensive, and user-friendly baseline, which is built with the MMDetection toolbox. It adopts abundant vision datasets for pre-training and various detection and grounding datasets for fine-tuning. We give a comprehensive analysis of each reported result and detailed settings for reproduction. The extensive experiments on the benchmarks mentioned demonstrate that our MM-Grounding-DINO-Tiny outperforms the Grounding-DINO-Tiny baseline. We release all our models to the research community. Codes and trained models are released at https://github.com/open-mmlab/mmdetection/tree/main/configs/mm_grounding_dino.
- Roboflow 100. people in paintings dataset. https://universe.roboflow.com/roboflow-100/people-in-paintings, 2023. visited on 2023-12-21.
- AABBCCEEFFGG. Brain tumor detection dataset. https://universe.roboflow.com/aabbcceeffgg/brain-tumor-detection-69d9s, 2022. visited on 2023-12-21.
- End-to-end object detection with transformers, 2020.
- MMDetection: Open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155, 2019.
- The cityscapes dataset for semantic urban scene understanding, 2016.
- Bert: Pre-training of deep bidirectional transformers for language understanding, 2019.
- Rethinking general underwater object detection: Datasets, challenges, and solutions. Neurocomputing, 517:243–256, 2023.
- Lvis: A dataset for large vocabulary instance segmentation, 2019.
- Grec: Generalized referring expression comprehension, 2023.
- spaCy: Industrial-strength Natural Language Processing in Python. 2020.
- Gqa: A new dataset for real-world visual reasoning and compositional question answering. Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
- Mdetr – modulated detection for end-to-end multi-modal understanding, 2021.
- ReferItGame: Referring to objects in photographs of natural scenes. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 787–798, Doha, Qatar, 2014. Association for Computational Linguistics.
- Benchmarking single image dehazing and beyond, 2019.
- Elevater: A benchmark and toolkit for evaluating language-augmented visual models, 2022a.
- Grounded language-image pre-training, 2022b.
- Microsoft coco: Common objects in context, 2015.
- Focal loss for dense object detection, 2018.
- Gres: Generalized referring expression segmentation, 2023a.
- Grounding dino: Marrying dino with grounded pre-training for open-set object detection, 2023b.
- Swin transformer: Hierarchical vision transformer using shifted windows, 2021.
- Generation and comprehension of unambiguous object descriptions, 2016.
- Kosmos-2: Grounding multimodal large language models to the world, 2023.
- Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. IJCV, 123(1):74–93, 2017.
- Learning transferable visual models from natural language supervision, 2021.
- Generalized intersection over union: A metric and a loss for bounding box regression, 2019.
- Objects365: A large-scale, high-quality dataset for object detection. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 8429–8438, 2019.
- V3det: Vast vocabulary visual detection dataset, 2023.
- Towards open vocabulary learning: A survey.
- Described object detection: Liberating object detection with flexible expressions, 2023.
- Recognize any regions, 2023.
- Cascade-detr: Delving into high-quality universal object detection, 2023.
- Modeling context in referring expressions, 2016.
- Dino: Detr with improved denoising anchor boxes for end-to-end object detection, 2022.
- Wei Li Zuwei Long. Open grounding dino:the third party implementation of the paper grounding dino. https://github.com/longzw1997/Open-GroundingDino, 2023.
- Xiangyu Zhao (192 papers)
- Yicheng Chen (24 papers)
- Shilin Xu (17 papers)
- Xiangtai Li (128 papers)
- Xinjiang Wang (32 papers)
- Yining Li (29 papers)
- Haian Huang (8 papers)