Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Open Vocabulary Object Detection with Pseudo Bounding-Box Labels (2111.09452v3)

Published 18 Nov 2021 in cs.CV

Abstract: Despite great progress in object detection, most existing methods work only on a limited set of object categories, due to the tremendous human effort needed for bounding-box annotations of training data. To alleviate the problem, recent open vocabulary and zero-shot detection methods attempt to detect novel object categories beyond those seen during training. They achieve this goal by training on a pre-defined base categories to induce generalization to novel objects. However, their potential is still constrained by the small set of base categories available for training. To enlarge the set of base classes, we propose a method to automatically generate pseudo bounding-box annotations of diverse objects from large-scale image-caption pairs. Our method leverages the localization ability of pre-trained vision-LLMs to generate pseudo bounding-box labels and then directly uses them for training object detectors. Experimental results show that our method outperforms the state-of-the-art open vocabulary detector by 8% AP on COCO novel categories, by 6.3% AP on PASCAL VOC, by 2.3% AP on Objects365 and by 2.8% AP on LVIS. Code is available at https://github.com/salesforce/PB-OVD.

Open Vocabulary Object Detection with Pseudo Bounding-Box Labels

The paper "Open Vocabulary Object Detection with Pseudo Bounding-Box Labels" presents a new approach designed to address the limitations inherent in existing object detection models that are restricted to a fixed set of object categories. The suggested methodology aims to expand the detection capabilities to a more diverse set of objects by leveraging pseudo bounding-box annotations derived from large-scale image-caption pairs, thus overcoming the constraints of traditional methods tied to costly manual annotations.

Summary of Approach

The paper delineates a two-component framework: a pseudo bounding-box label generator and an open vocabulary object detector. The label generator utilizes existing vision-LLMs to automatically generate bounding-box labels for a wide array of objects mentioned in image captions. This is achieved by exploiting the text-to-visual alignment capabilities embedded within pre-trained vision-LLMs, thereby enabling the synthesis of bounding-box annotations without human intervention.

To obtain the pseudo labels, the generator computes activation maps using Grad-CAM based on the cross-attentional interactions between image regions and caption tokens in a pre-trained vision-LLM. Box proposals, overlapping the most with these activated regions, are selected as the pseudo labels. This strategy essentially bypasses the traditional dependence on manual annotation, allowing for scalability in training with numerous categories.

Performance Evaluation

The efficacy of the proposed methodology is validated through comprehensive experiments on multiple benchmark datasets, including COCO, PASCAL VOC, Objects365, and LVIS. Notably, the experimental results indicate that the method outperforms state-of-the-art baselines by a significant margin across these datasets, showing an increase of 8% AP for COCO novel categories when fine-tuned, and even achieving notable results without any fine-tuning.

Additionally, the paper investigates generalization capabilities to other datasets, demonstrating superior performance in comparison to existing methods. This reinforces the practicality of such a framework in real-world applications where the necessity for robust generalization across novel datasets is paramount.

Implications and Future Directions

The proposed framework carries several theoretical and practical implications. Theoretically, it pushes the boundaries of open vocabulary object detection by showcasing the potential of pseudo-label generation to expand learning capabilities beyond fixed categories. Practically, it addresses the limitations of existing models in real-world applications, where encountering novel objects that were not defined during training is common.

Looking ahead, the paper suggests that the approach could be further optimized by experimenting with different types of vision-LLMs or incorporating improved algorithms in generating object proposals. This opens new avenues for enhancing the accuracy and scalability of pseudo-label generation, ultimately advancing open vocabulary detection capabilities in AI systems.

The work represents a significant step in redefining how object detection frameworks are constructed, steering away from manual supervision toward autonomous learning models that can deal with vast, diverse datasets efficiently.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Mingfei Gao (26 papers)
  2. Chen Xing (31 papers)
  3. Juan Carlos Niebles (95 papers)
  4. Junnan Li (56 papers)
  5. Ran Xu (89 papers)
  6. Wenhao Liu (83 papers)
  7. Caiming Xiong (337 papers)
Citations (75)