DetCLIPv3: Towards Versatile Generative Open-vocabulary Object Detection (2404.09216v1)

Published 14 Apr 2024 in cs.CV

Abstract: Existing open-vocabulary object detectors typically require a predefined set of categories from users, significantly confining their application scenarios. In this paper, we introduce DetCLIPv3, a high-performing detector that excels not only at both open-vocabulary object detection, but also generating hierarchical labels for detected objects. DetCLIPv3 is characterized by three core designs: 1. Versatile model architecture: we derive a robust open-set detection framework which is further empowered with generation ability via the integration of a caption head. 2. High information density data: we develop an auto-annotation pipeline leveraging visual LLM to refine captions for large-scale image-text pairs, providing rich, multi-granular object labels to enhance the training. 3. Efficient training strategy: we employ a pre-training stage with low-resolution inputs that enables the object captioner to efficiently learn a broad spectrum of visual concepts from extensive image-text paired data. This is followed by a fine-tuning stage that leverages a small number of high-resolution samples to further enhance detection performance. With these effective designs, DetCLIPv3 demonstrates superior open-vocabulary detection performance, \eg, our Swin-T backbone model achieves a notable 47.0 zero-shot fixed AP on the LVIS minival benchmark, outperforming GLIPv2, GroundingDINO, and DetCLIPv2 by 18.0/19.6/6.6 AP, respectively. DetCLIPv3 also achieves a state-of-the-art 19.7 AP in dense captioning task on VG dataset, showcasing its strong generative capability.

PDF Abstract

An Analytical Overview of DetCLIPv3: Advancements in Open-Vocabulary Object Detection and Generative Capabilities

The paper, "DetCLIPv3: Towards Versatile Generative Open-vocabulary Object Detection," introduces DetCLIPv3, a progressive open-vocabulary object detection (OVD) model with an integrated capability for generating hierarchical object labels. This framework constitutes a substantial leap over existing OVD techniques by alleviating dependency on predefined object categories, thereby enhancing usability across diverse application scenarios.

Core Contributions of DetCLIPv3

DetCLIPv3 is distinguished by three pivotal innovations: its versatile model architecture, a rich data pipeline, and an efficient multi-stage training strategy.

Model Architecture: The DetCLIPv3 architecture is fortified by a robust open-vocabulary detection framework augmented with a generative captioner. This design empowers the model to capture a broad expanse of visual concepts, both recognizing objects by category names and generating detailed hierarchies of objects' labels. The modular design facilitates the integration of a LLMing training objective, allowing not just precise object localization but also the coherent generation of hierarchical descriptions. This aspect underscores the model's adeptness in mimicking the human-like multi-level recognition ability.
High Information Density Data: The authors developed a sophisticated auto-annotation pipeline that leverages state-of-the-art vision LLMs (VLLMs) to refine object captions from huge image-text datasets. This pipeline, termed GranuCap50M, enriches training data with multi-level object labels, thereby reinforcing both the detection and generative capabilities of DetCLIPv3. The methodology efficiently mitigates the typical shortcomings of datasets, such as misalignment and partial annotations.
Efficient Training Strategy: Addressing the intricacies related to high-resolution input training, DetCLIPv3 employs an effective pre-training and fine-tuning strategy. Initial training on low-resolution data enables a broad spectrum learning of visual concepts, with subsequent high-resolution fine-tuning enhancing specific detection features. This methodology curtails the training expense while boosting model performance.

Performance Evaluation

DetCLIPv3 was rigorously evaluated across various benchmarks, showcasing its enhanced capabilities:

Zero-Shot OVD Performance: On the LVIS minival benchmark, DetCLIPv3 with a Swin-T backbone achieved an impressive 47.0 zero-shot fixed AP, yielding significant improvements over competitors like GLIPv2, DetCLIPv2, and GroundingDINO. The improvement underscores the model's ability to recognize rare and novel categories through its enriched visual concept learning.
Generative Object Detection and Dense Captioning: DetCLIPv3's object captioner demonstrated formidable generative abilities, achieving a 19.7 AP in the dense captioning task on the VG dataset. This highlights the model's potential in generating comprehensive object descriptions across a broader concept spectrum.
Robustness and Transferability: The model's performance on COCO-O revealed its robust domain generalization, achieving a notable effective robustness gain. Furthermore, DetCLIPv3 showed superior transferability when fine-tuned on varied datasets, consistently outperforming other state-of-the-art methods.

Implications and Future Directions

The implications of DetCLIPv3 extend across both practical applications and theoretical exploration. By reducing the reliance on predefined categories, the model is poised to adapt to diverse visual environments, offering solutions in areas where traditional OVD systems might falter. The generative component can also provide detailed object context, aiding applications such as autonomous systems or digital content creation where nuanced object interaction is crucial.

Future research directions may focus on refining the generative capabilities further. This includes developing metrics for evaluating generative results and integrating LLMs to enable instruction-controled detection, potentially leading to fully autonomous system interaction capabilities. Moreover, expanding the auto-annotation pipeline's scalability and efficiency could further enhance the model's utility across larger datasets and varied domains.

In conclusion, DetCLIPv3 sets a new benchmark in open-vocabulary object detection with its dual focus on detection and generation, leveraging innovative methodologies and dataset curation strategies. Its holistic approach promises considerable impact across both research and industry applications, reinforcing the trajectory towards more versatile and intelligent object detection systems.

PDF Markdown Bookmark Chat (Pro)

Authors (8)

Lewei Yao (15 papers)
Renjie Pi (37 papers)
Jianhua Han (49 papers)
Xiaodan Liang (318 papers)
Hang Xu (204 papers)
Wei Zhang (1489 papers)
Zhenguo Li (195 papers)
Dan Xu (120 papers)

Citations (9)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/danxuhk/status/1807036211623862599

https://twitter.com/CSVisionPapers/status/1780404942093648150