Scaling Open-Vocabulary Object Detection (2306.09683v3)

Published 16 Jun 2023 in cs.CV

Abstract: Open-vocabulary object detection has benefited greatly from pretrained vision-LLMs, but is still limited by the amount of available detection training data. While detection training data can be expanded by using Web image-text pairs as weak supervision, this has not been done at scales comparable to image-level pretraining. Here, we scale up detection data with self-training, which uses an existing detector to generate pseudo-box annotations on image-text pairs. Major challenges in scaling self-training are the choice of label space, pseudo-annotation filtering, and training efficiency. We present the OWLv2 model and OWL-ST self-training recipe, which address these challenges. OWLv2 surpasses the performance of previous state-of-the-art open-vocabulary detectors already at comparable training scales (~10M examples). However, with OWL-ST, we can scale to over 1B examples, yielding further large improvement: With an L/14 architecture, OWL-ST improves AP on LVIS rare classes, for which the model has seen no human box annotations, from 31.2% to 44.6% (43% relative improvement). OWL-ST unlocks Web-scale training for open-world localization, similar to what has been seen for image classification and LLMling.

PDF Abstract

Scaling Open-Vocabulary Object Detection: A Summary

The paper "Scaling Open-Vocabulary Object Detection" by Matthias Minderer et al. explores advancements in open-vocabulary object detection by leveraging large-scale self-training techniques. Object detection is a pivotal task in computer vision with numerous applications, yet extending models to support open-vocabulary settings poses significant challenges due to the limitations of available annotated detection datasets. This paper puts forward the OWLv2 model and an innovative OWL-ST self-training methodology to address these challenges.

Key Contributions and Findings

Self-Training on Web-Scale Data: The authors introduce a self-training approach that utilizes an existing open-vocabulary detector to generate pseudo-box annotations for a massive dataset, WebLI, containing 10 billion image-text pairs. This approach allows the model to leverage weak semantic supervision from these pairs, significantly surpassing the traditional data constraint limitations.
Improved Model Architecture: The OWLv2 architecture is optimized for training efficiency, incorporating techniques like token dropping and instance selection to reduce computation without sacrificing performance. This optimized version enhances throughput and FLOP efficiency by approximately 50% compared to its predecessor while maintaining competitive accuracy across detection tasks.
Scaling Training Data: With the OWL-ST method, the researchers scale the training set size to over a billion examples, achieving substantial performance improvements. For example, using an OWL-ST model with an L/14 architecture improves the Average Precision (AP) on LVIS rare classes from 31.2% to 44.6%—demonstrating a 43% relative improvement. Through this paradigm, the model takes advantage of web-scale training data, similar to strategies seen in image classification and LLMing.
Label Space and Filtering: The authors propose an innovative labeling approach by using all possible N-grams from image-associated texts as detection prompts, paired with minimal filtering of pseudo-labels. This maximizes the variance of semantic contexts used during training, further enhancing open-vocabulary performance.
Model Scaling and Performance: The research confirms that larger models benefit disproportionately from extensive self-training, echoing results seen in other domains like LLMing. The paper reveals that open-vocabulary detection is highly scalable, drawing parallels with the scaling laws discovered in vision transformers and other neural networks.

Implications

Theoretical Impacts: The work demonstrates that self-training on pseudo-labeled web-scale datasets provides a viable pathway for improving open-vocabulary object detection. It suggests that further scaling is both feasible and beneficial, presenting opportunities for future research in scaling strategies and model architectures tailored for open-vocabulary tasks.
Practical Advances: Practically, this research suggests more robust detection models capable of performing well on less-frequently encountered or entirely novel object classes. It opens the door for applications in diverse environments without the need for exhaustive manual labeling.
Future Directions: Future directions may include exploring even larger model capacities or alternative architectures that could benefit from even larger datasets, optimally balancing compute efficiency and model complexity. Additionally, improving the robustness and calibration of these models for fine-tuned applications versus open-world settings remains an open challenge.

In conclusion, this paper provides substantial advancements in using self-training methodologies on web-scale image-text data for open-vocabulary object detection. The OWL-ST framework and OWLv2 model represent significant steps forward, unlocking new potential for both current applications and future explorations in vision-LLM integration.