PromptDet: Towards Open-vocabulary Detection using Uncurated Images (2203.16513v2)

Published 30 Mar 2022 in cs.CV

Abstract: The goal of this work is to establish a scalable pipeline for expanding an object detector towards novel/unseen categories, using zero manual annotations. To achieve that, we make the following four contributions: (i) in pursuit of generalisation, we propose a two-stage open-vocabulary object detector, where the class-agnostic object proposals are classified with a text encoder from pre-trained visual-LLM; (ii) To pair the visual latent space (of RPN box proposals) with that of the pre-trained text encoder, we propose the idea of regional prompt learning to align the textual embedding space with regional visual object features; (iii) To scale up the learning procedure towards detecting a wider spectrum of objects, we exploit the available online resource via a novel self-training framework, which allows to train the proposed detector on a large corpus of noisy uncurated web images. Lastly, (iv) to evaluate our proposed detector, termed as PromptDet, we conduct extensive experiments on the challenging LVIS and MS-COCO dataset. PromptDet shows superior performance over existing approaches with fewer additional training images and zero manual annotations whatsoever. Project page with code: https://fcjian.github.io/promptdet.

Authors (8)

Chengjian Feng (20 papers)
Yujie Zhong (50 papers)
Zequn Jie (60 papers)
Xiangxiang Chu (62 papers)
Haibing Ren (8 papers)
Xiaolin Wei (42 papers)
Weidi Xie (132 papers)
Lin Ma (206 papers)

Citations (133)

View on Semantic Scholar

Summary

Overview of "PromptDet: Towards Open-vocabulary Detection using Uncurated Images"

The paper "PromptDet: Towards Open-vocabulary Detection using Uncurated Images" presents a methodology for enhancing object detection capabilities to encompass novel or unseen categories without involving any manual annotations. The authors propose a scalable pipeline, PromptDet, that leverages a combination of a two-stage open-vocabulary object detector, regional prompt learning, and self-training on uncurated web images to achieve notable advancements in this domain.

Key Contributions

Two-stage Open-vocabulary Object Detection: The methodology incorporates a two-stage framework where class-agnostic object proposals are classified using a text encoder derived from pre-trained visual-LLMs. This approach is inspired by advancements in visual-language pre-training and aims to widen the detector's vocabulary without manual labeling efforts.
Regional Prompt Learning: To better align the regional visual features with the embeddings of the pre-trained text encoder, the authors introduce regional prompt learning. By fine-tuning prompt vectors, the textual embedding space is transformed to align more closely with object-centric visual features.
Self-training with Uncurated Images: The paper also explores scaling the learning process to accommodate a wider array of objects by utilizing a novel self-training framework. This involves training the detector with a large corpus of uncurated web images, leveraging noisy image data without ground-truth boxes.
Evaluation and Performance: Through extensive experimentation on standard datasets like LVIS and MS-COCO, PromptDet demonstrates superior performance compared to existing methods, using substantial fewer additional training images and completely eliminating the requirement for manual annotations.

Methodological Insights

Open-vocabulary Classification: By employing a textual encoder as a classifier generator, the visual backbone is aligned to recognize categories without being explicitly trained on data for those novel categories. This circumvents the traditional limitation where detectors operate under a closed set of categories.
Iterative Image Sourcing and Self-training: A significant portion of the framework involves an iterative process of regional prompt learning combined with sourcing external images. This iterative process refines the prompt vectors, allowing for progressively better acquisition of relevant, object-centric candidate images from large datasets like LAION-400M.

Implications and Future Directions

The research offers a pathway towards removing the dependency on exhaustive manual annotation, which has been a bottleneck for scaling object detectors to operate on extensive vocabularies. By leveraging the inherent relationships between language and visual content, as captured by models like CLIP, the framework introduces a paradigm where object categories can be expanded effortlessly.

The results suggest that the use of indirect supervision via text-based prompts, when aligned effectively with visual representations, can enable substantial advancements in object detection tasks. Looking forward, these concepts may be further refined and integrated with future architectures and training regimes, potentially incorporating context-aware learning or generative methods to enhance the richness of training data. Additionally, exploring various domains or deployment scenarios, such as autonomous driving or augmented reality, could benefit further from these innovations.

As the field advances, the principles introduced in this paper regarding how visual-language pre-training can mitigate traditional dataset constraints provide a solid foundation for future exploratory research in open-vocabulary detection and beyond.

PDF Markdown

Related Papers

Find Related Papers