Introduction
Object detection, a fundamental vision task, offers an object-centric understanding of visual content and is crucial for numerous applications. Traditional object detection systems, however, are often limited by a predefined set of categories they can recognize, contrasting with human visual intelligence which can identify a myriad of objects. To bridge this gap, a focus has shifted towards open-vocabulary object detection, training detectors capable of recognizing objects of arbitrary categories. This paper introduces CoDet, an innovative open-vocabulary detection framework that eschews the conventional reliance on pre-aligned vision-LLMs, instead of employing cross-image visual clues to discover and align region-word pairs.
Methodology
CoDet's approach is based on the intuition that objects corresponding to the same concept usually manifest similar visual features across different images, offering clues to identify region-word correspondences. The method clusters images mentioning the same concept in their captions, inferring the presence of a shared object. It then calculates the similarity between regions across these images to locate common objects and creates a 'prototype' from the identified regions. This prototype is aligned with the concept word to form a region-text pair which supervises the detector's training.
To enhance the accuracy of visual similarities, CoDet introduces text guidance which makes the similarity estimation between region proposals concept-aware. This is achieved by re-weighting similarity calculations with the text embedding of the concept, thereby emphasizing feature dimensions more relevant to the concept. CoDet's methodology is distinct in avoiding dependence on any pre-aligned vision-language space.
Performance and Scalability
The paper reports extensive experiments showcasing CoDet's performances on the challenging OV-LVIS benchmark and across multiple datasets such as COCO and Objects365. Notably, CoDet is demonstrated to excel when scaled up with stronger visual backbones. This indicates the method's superior capability to generalize and robustly handle variance within visual data. The specific metrics provided illustrate a substantial improvement over previous state-of-the-art methodologies, confirming CoDet's efficacy.
Conclusion
In summary, CoDet paves a new path in open-vocabulary object detection by leveraging visual similarities inherent in images that discuss similar concepts. It outperforms current methods by reformulating region-word alignment as a visual pattern recognition task, simplified through concept groups and refined with text guidance. CoDet's encouraging results indicate that it could be a significant step towards more human-like understanding and detection of objects across an open-world setting. The trained CoDet models and code are made publicly available as a resource to the research community.