Learning to Prompt for Open-Vocabulary Object Detection with Vision-Language Model (2203.14940v1)

Published 28 Mar 2022 in cs.CV

Abstract: Recently, vision-language pre-training shows great potential in open-vocabulary object detection, where detectors trained on base classes are devised for detecting new classes. The class text embedding is firstly generated by feeding prompts to the text encoder of a pre-trained vision-LLM. It is then used as the region classifier to supervise the training of a detector. The key element that leads to the success of this model is the proper prompt, which requires careful words tuning and ingenious design. To avoid laborious prompt engineering, there are some prompt representation learning methods being proposed for the image classification task, which however can only be sub-optimal solutions when applied to the detection task. In this paper, we introduce a novel method, detection prompt (DetPro), to learn continuous prompt representations for open-vocabulary object detection based on the pre-trained vision-LLM. Different from the previous classification-oriented methods, DetPro has two highlights: 1) a background interpretation scheme to include the proposals in image background into the prompt training; 2) a context grading scheme to separate proposals in image foreground for tailored prompt training. We assemble DetPro with ViLD, a recent state-of-the-art open-world object detector, and conduct experiments on the LVIS as well as transfer learning on the Pascal VOC, COCO, Objects365 datasets. Experimental results show that our DetPro outperforms the baseline ViLD in all settings, e.g., +3.4 APbox and +3.0 APmask improvements on the novel classes of LVIS. Code and models are available at https://github.com/dyabel/detpro.

PDF Abstract

Learning to Prompt for Open-Vocabulary Object Detection with Vision-LLM

The paper "Learning to Prompt for Open-Vocabulary Object Detection with Vision-LLM" presents a novel approach to open-vocabulary object detection (OVOD), leveraging pre-trained vision-LLMs. This work specifically addresses the challenges associated with designing prompt representations, pivotal in transferring the capabilities of vision-LLMs to object detection tasks for unobserved classes.

The essence of this research lies in circumventing the manual design of prompts, which is often cumbersome and requires substantial domain expertise. The authors introduce a method called Detection Prompt (DetPro), which automates the process of learning prompt representations for object detection. Unlike prior methods that largely focus on image classification, DetPro is tailored for object detection and introduces two main mechanisms: background interpretation and context grading.

The background interpretation scheme in DetPro addresses the inclusion of negative proposals. In object detection, proposals often contain background areas not corresponding to any class labels. DetPro optimizes embeddings to ensure these proposals remain distinct from all class embeddings, enhancing the robustness of the detection process. In the context grading scheme, positive proposals from images are graded based on their intersection over union (IoU) with the ground truth. This grading allows prompt learning to capture varying levels of contextual information within positive proposals, reflecting a more nuanced understanding of object localization.

The researchers integrate DetPro with ViLD, a state-of-the-art OVOD framework. Through experiments on the LVIS dataset and transfer learning evaluations on Pascal VOC, COCO, and Objects365, the implementation of DetPro demonstrated significant performance improvements over the baseline. Notably, DetPro achieved an increase of +3.4 AP for box predictions and +3.0 AP for mask predictions on novel classes of LVIS, underscoring its efficacy.

Practically, this advancement means that object detectors can now more effectively identify and classify objects outside their initial training set. The automatic learning of prompt representations eliminates the reliance on labor-intensive prompt engineering and allows for broader application of deep learning in diverse detection scenarios. Theoretically, DetPro enriches the dialogue around embedding and aligning cross-modal data (image and text) in machine learning, further underlining the latent alignment capabilities within pre-trained models.

The DetPro approach exemplifies a step forward in using vision-LLMs for broader detection tasks. Its implications are significant as applications that require detecting objects unobserved during training, such as autonomous driving or wildlife monitoring, can leverage DetPro for improved performance. Future developments may explore expanding this methodology to incorporate additional modalities or refine the grading approach for prompt representations, potentially increasing the precision and recall of OVOD techniques.

In summary, this paper contributes a sophisticated method for leveraging vision-LLMs in open-vocabulary object detection, enabling advances that bridge traditionally separate domains of vision and language in AI research.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Yu Du (52 papers)
Fangyun Wei (53 papers)
Zihe Zhang (2 papers)
Miaojing Shi (53 papers)
Yue Gao (146 papers)
Guoqi Li (90 papers)

Citations (286)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - dyabel/detpro (182 stars)