OV-DINO: Unified Open-Vocabulary Detection with Language-Aware Selective Fusion (2407.07844v2)

Published 10 Jul 2024 in cs.CV

Abstract: Open-vocabulary detection is a challenging task due to the requirement of detecting objects based on class names, including those not encountered during training. Existing methods have shown strong zero-shot detection capabilities through pre-training and pseudo-labeling on diverse large-scale datasets. However, these approaches encounter two main challenges: (i) how to effectively eliminate data noise from pseudo-labeling, and (ii) how to efficiently leverage the language-aware capability for region-level cross-modality fusion and alignment. To address these challenges, we propose a novel unified open-vocabulary detection method called OV-DINO, which is pre-trained on diverse large-scale datasets with language-aware selective fusion in a unified framework. Specifically, we introduce a Unified Data Integration (UniDI) pipeline to enable end-to-end training and eliminate noise from pseudo-label generation by unifying different data sources into detection-centric data format. In addition, we propose a Language-Aware Selective Fusion (LASF) module to enhance the cross-modality alignment through a language-aware query selection and fusion process. We evaluate the performance of the proposed OV-DINO on popular open-vocabulary detection benchmarks, achieving state-of-the-art results with an AP of 50.6% on the COCO benchmark and 40.1% on the LVIS benchmark in a zero-shot manner, demonstrating its strong generalization ability. Furthermore, the fine-tuned OV-DINO on COCO achieves 58.4% AP, outperforming many existing methods with the same backbone. The code for OV-DINO is available at https://github.com/wanghao9610/OV-DINO.

PDF HTML Abstract

Review of "OV-DINO: Unified Open-Vocabulary Detection with Language-Aware Selective Fusion"

The paper introduces "OV-DINO", a novel model designed for open-vocabulary object detection (OVD), a challenging task due to its requirement for models to detect objects using class names not seen during training. The existing methodologies have primarily relied on pre-training and pseudo-labeling on diverse datasets to enhance zero-shot detection capabilities. However, these approaches encounter significant challenges related to data noise from pseudo-labeling and the efficient use of language-aware capabilities for modality fusion and alignment.

Key Components and Innovations

Unified Data Integration (UniDI) Pipeline: The paper proposes a Unified Data Integration pipeline to streamline the pre-training process by integrating varied data sources into a unified detection-centric data format. This includes detection, grounding, and image-text data, converted into a standardized format using concepts like the "Caption Box" and "Unified Prompt". This integration offers a cleaner approach by eliminating the need for pseudo-label generation on image-text data and enhances the semantic understanding of the model.
Language-Aware Selective Fusion (LASF) Module: The authors introduce a Language-Aware Selective Fusion module, which enriches cross-modality alignment by executing a language-aware query selection and fusion process. The LASF module selects text-related object embeddings, which are then injected into queries to improve modality alignment, thus facilitating greater precision in object detection tasks.
Training Framework: OV-DINO is pre-trained on large-scale datasets using a detection-centric framework that aligns with the training objectives of established detectors like DINO, yet tailored to accommodate open-vocabulary detection through strategic modifications.

Numerical and Empirical Results

The paper reports strong performance across well-regarded benchmarks. When evaluated on the COCO and LVIS datasets, OV-DINO achieves state-of-the-art results, securing an Average Precision (AP) of 50.6% on the COCO benchmark, and 40.1% on the LVIS benchmark, all evaluated in a zero-shot setting. Moreover, the fine-tuned OV-DINO on COCO achieves 58.4% AP, demonstrating its competitive edge over existing methodologies using similar backbone architectures.

Implications and Future Directions

Practically, the methodologies discussed in this paper open avenues for more efficient and effective training regimes for open-vocabulary tasks, with OV-DINO demonstrating that unifying data handling and explicitly enhancing cross-modality fusion leads to significant performance gains. Theoretically, the introduction of LASF extends the exploration of cross-modal interaction, driving further research into selective, dynamic fusion strategies that leverage latent language structure for improved visual tasks.

Looking forward, scaling OV-DINO with larger backbone architectures and more extensive datasets could further yield improvements in model efficiency and robustness. However, the training process's computational demands highlight an area for further optimization and research, aiming to balance model complexity with practical deployment considerations.

Overall, OV-DINO provides a comprehensive framework that effectively tackles inherent challenges in open-vocabulary detection, contributing substantially to the advancement of robust, generalized object detection solutions in diverse applications.

PDF Markdown Bookmark Chat (Pro)

Authors (11)

Hao Wang (1119 papers)
Pengzhen Ren (15 papers)
Zequn Jie (60 papers)
Xiao Dong (62 papers)
Chengjian Feng (20 papers)
Yinlong Qian (8 papers)
Lin Ma (206 papers)
Dongmei Jiang (31 papers)
Yaowei Wang (149 papers)
Xiangyuan Lan (25 papers)
Xiaodan Liang (318 papers)

Citations (1)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - wanghao9610/OV-DINO: OV-DINO: Unified Open-Vocabulary Detection with Language-Aware Selective Fusion (271 stars)

Tweets

https://twitter.com/gm8xx8/status/1811218816174555255