Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CoDet: Co-Occurrence Guided Region-Word Alignment for Open-Vocabulary Object Detection (2310.16667v1)

Published 25 Oct 2023 in cs.CV

Abstract: Deriving reliable region-word alignment from image-text pairs is critical to learn object-level vision-language representations for open-vocabulary object detection. Existing methods typically rely on pre-trained or self-trained vision-LLMs for alignment, which are prone to limitations in localization accuracy or generalization capabilities. In this paper, we propose CoDet, a novel approach that overcomes the reliance on pre-aligned vision-language space by reformulating region-word alignment as a co-occurring object discovery problem. Intuitively, by grouping images that mention a shared concept in their captions, objects corresponding to the shared concept shall exhibit high co-occurrence among the group. CoDet then leverages visual similarities to discover the co-occurring objects and align them with the shared concept. Extensive experiments demonstrate that CoDet has superior performances and compelling scalability in open-vocabulary detection, e.g., by scaling up the visual backbone, CoDet achieves 37.0 $\text{AP}m_{novel}$ and 44.7 $\text{AP}m_{all}$ on OV-LVIS, surpassing the previous SoTA by 4.2 $\text{AP}m_{novel}$ and 9.8 $\text{AP}m_{all}$. Code is available at https://github.com/CVMI-Lab/CoDet.

Introduction

Object detection, a fundamental vision task, offers an object-centric understanding of visual content and is crucial for numerous applications. Traditional object detection systems, however, are often limited by a predefined set of categories they can recognize, contrasting with human visual intelligence which can identify a myriad of objects. To bridge this gap, a focus has shifted towards open-vocabulary object detection, training detectors capable of recognizing objects of arbitrary categories. This paper introduces CoDet, an innovative open-vocabulary detection framework that eschews the conventional reliance on pre-aligned vision-LLMs, instead of employing cross-image visual clues to discover and align region-word pairs.

Methodology

CoDet's approach is based on the intuition that objects corresponding to the same concept usually manifest similar visual features across different images, offering clues to identify region-word correspondences. The method clusters images mentioning the same concept in their captions, inferring the presence of a shared object. It then calculates the similarity between regions across these images to locate common objects and creates a 'prototype' from the identified regions. This prototype is aligned with the concept word to form a region-text pair which supervises the detector's training.

To enhance the accuracy of visual similarities, CoDet introduces text guidance which makes the similarity estimation between region proposals concept-aware. This is achieved by re-weighting similarity calculations with the text embedding of the concept, thereby emphasizing feature dimensions more relevant to the concept. CoDet's methodology is distinct in avoiding dependence on any pre-aligned vision-language space.

Performance and Scalability

The paper reports extensive experiments showcasing CoDet's performances on the challenging OV-LVIS benchmark and across multiple datasets such as COCO and Objects365. Notably, CoDet is demonstrated to excel when scaled up with stronger visual backbones. This indicates the method's superior capability to generalize and robustly handle variance within visual data. The specific metrics provided illustrate a substantial improvement over previous state-of-the-art methodologies, confirming CoDet's efficacy.

Conclusion

In summary, CoDet paves a new path in open-vocabulary object detection by leveraging visual similarities inherent in images that discuss similar concepts. It outperforms current methods by reformulating region-word alignment as a visual pattern recognition task, simplified through concept groups and refined with text guidance. CoDet's encouraging results indicate that it could be a significant step towards more human-like understanding and detection of objects across an open-world setting. The trained CoDet models and code are made publicly available as a resource to the research community.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Chuofan Ma (8 papers)
  2. Yi Jiang (171 papers)
  3. Xin Wen (64 papers)
  4. Zehuan Yuan (65 papers)
  5. Xiaojuan Qi (133 papers)
Citations (34)