Open-Vocabulary Object Detection Using Captions (2011.10678v2)

Published 20 Nov 2020 in cs.CV, cs.AI, and cs.LG

Abstract: Despite the remarkable accuracy of deep neural networks in object detection, they are costly to train and scale due to supervision requirements. Particularly, learning more object categories typically requires proportionally more bounding box annotations. Weakly supervised and zero-shot learning techniques have been explored to scale object detectors to more categories with less supervision, but they have not been as successful and widely adopted as supervised models. In this paper, we put forth a novel formulation of the object detection problem, namely open-vocabulary object detection, which is more general, more practical, and more effective than weakly supervised and zero-shot approaches. We propose a new method to train object detectors using bounding box annotations for a limited set of object categories, as well as image-caption pairs that cover a larger variety of objects at a significantly lower cost. We show that the proposed method can detect and localize objects for which no bounding box annotation is provided during training, at a significantly higher accuracy than zero-shot approaches. Meanwhile, objects with bounding box annotation can be detected almost as accurately as supervised methods, which is significantly better than weakly supervised baselines. Accordingly, we establish a new state of the art for scalable object detection.

PDF Abstract

Open-Vocabulary Object Detection Using Captions: Overview and Insights

The paper "Open-Vocabulary Object Detection Using Captions" introduces a novel approach for tackling the object detection problem by proposing a framework termed as Open-Vocabulary Object Detection (OVD). This approach aims to mitigate the limitations posed by traditional methods that rely heavily on exhaustive supervision, particularly through bounding box annotations, which are both costly and labor-intensive.

Key Contributions

The authors present a methodology that diverges from conventional supervised models by leveraging image-caption pairs as a form of natural supervision to enhance the vocabulary of detected objects without requiring explicit bounding box annotations for each object category. This method is positioned as more effective than previously explored weakly supervised (WS) and zero-shot detection (ZSD) approaches.

The core innovation of this work is the two-stage framework that first constructs a visual-semantic space using low-cost image-caption pairs and subsequently learns object detection for a base set of classes equipped with bounding box annotations. This visual-semantic space facilitates the detection of object categories beyond the annotated set, effectively enabling the detection of unseen classes during training.

Experimental Results

The results showcased the method's strong numerical outcomes, where the Open Vocabulary R-CNN (OVR-CNN) model exhibited superior performance compared to ZSD and WS methods. The mAP for unseen classes reached 27%, a substantial improvement over the 10% achieved by state-of-the-art zero-shot methods. Moreover, under generalized zero-shot settings, the OVR-CNN outperformed WS models with a 40% mAP as opposed to the 26% by existing methods.

These results highlight the model's capability in extrapolating object recognition by simulating human-like natural supervision processes, achieving comparable accuracy in recognizing classes that were not present in the training phase.

Implications and Future Directions

The implications of this research are significant in advancing scalable object detection techniques. The proposed framework holds potential for practical applications where manual annotations are infeasible, paving the way for the deployment of more adaptable detection systems in real-world scenarios. The utilization of freely available image-caption data reduces the resource dependency, fostering advancements in domains where extensive labeled datasets are not available.

From a theoretical standpoint, OVD presents a paradigm shift in disentangling object recognition and localization by suggesting that object recognition can be dramatically scaled using semantic understanding derived from captions. This decoupling opens avenues for applying similar principles to other computer vision tasks, thereby expanding AI's applicability.

Future research could explore enhancements in handling bias within training data or addressing localization accuracy shortcomings, particularly when dealing with target classes that lack bounding box annotations. Additionally, as this approach relies on leveraging captions, examining the quality variations across different caption datasets could provide more insights into the robustness and adaptability of the proposed framework.

In conclusion, "Open-Vocabulary Object Detection Using Captions" broadens the horizon for object detection research, offering an innovative solution that effectively balances between supervised and naturally supervised learning paradigms to achieve scalability in object detection tasks.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Alireza Zareian (16 papers)
Kevin Dela Rosa (6 papers)
Derek Hao Hu (2 papers)
Shih-Fu Chang (131 papers)

Citations (368)

View on Semantic Scholar

Open-Vocabulary Object Detection Using Captions (2011.10678v2)

Open-Vocabulary Object Detection Using Captions: Overview and Insights

Key Contributions

Experimental Results

Implications and Future Directions

Related Papers