Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CORA: Adapting CLIP for Open-Vocabulary Detection with Region Prompting and Anchor Pre-Matching (2303.13076v1)

Published 23 Mar 2023 in cs.CV and cs.AI

Abstract: Open-vocabulary detection (OVD) is an object detection task aiming at detecting objects from novel categories beyond the base categories on which the detector is trained. Recent OVD methods rely on large-scale visual-language pre-trained models, such as CLIP, for recognizing novel objects. We identify the two core obstacles that need to be tackled when incorporating these models into detector training: (1) the distribution mismatch that happens when applying a VL-model trained on whole images to region recognition tasks; (2) the difficulty of localizing objects of unseen classes. To overcome these obstacles, we propose CORA, a DETR-style framework that adapts CLIP for Open-vocabulary detection by Region prompting and Anchor pre-matching. Region prompting mitigates the whole-to-region distribution gap by prompting the region features of the CLIP-based region classifier. Anchor pre-matching helps learning generalizable object localization by a class-aware matching mechanism. We evaluate CORA on the COCO OVD benchmark, where we achieve 41.7 AP50 on novel classes, which outperforms the previous SOTA by 2.4 AP50 even without resorting to extra training data. When extra training data is available, we train CORA$+$ on both ground-truth base-category annotations and additional pseudo bounding box labels computed by CORA. CORA$+$ achieves 43.1 AP50 on the COCO OVD benchmark and 28.1 box APr on the LVIS OVD benchmark.

CORA: An Advanced Framework for Open-Vocabulary Detection

The task of open-vocabulary detection (OVD) presents a significant challenge within computer vision due to the requirement for detecting objects from novel categories without the need for additional annotations. The paper "CORA: Adapting CLIP for Open-Vocabulary Detection with Region Prompting and Anchor Pre-Matching" introduces a novel approach leveraging the CLIP model to address this issue effectively. CLIP, a pre-trained vision-LLM, has shown remarkable potential in visual recognition tasks by learning a joint embedding space for images and text.

Problem Identification

The authors identify two main obstacles when applying CLIP to the OVD task:

  1. Distribution Mismatch: CLIP is pre-trained on whole images, but OVD demands region-level recognition. This mismatch often results in decreased classification performance when region features are treated as separate images.
  2. Localization of Unseen Classes: Recognizing novel objects is constrained by the limited success of prior region proposals networks (RPNs), which perform suboptimally on classes beyond their training set.

Approach

The proposed CORA framework aims to mitigate these issues through two innovative strategies: Region Prompting and Anchor Pre-Matching.

  • Region Prompting: This method augments CLIP's region classification capability by applying learnable prompts to regional features, thus rectifying the distribution gap between whole-image and region-specific features.
  • Anchor Pre-Matching: It enhances object localization by employing a dynamic, class-aware matching mechanism. By pre-matching anchor boxes with class-specific embeddings, the model efficiently handles class-aware query conditioning within a DETR-style architecture.

Results

The capability of the CORA framework is validated on the COCO and LVIS benchmarks, showcasing a notable improvement in detecting novel categories without the need for additional training data. Specifically, CORA achieves 41.7 AP50 on COCO's novel categories, outperforming previous state-of-the-art methods by 2.4 AP50. Furthermore, when additional data is utilized (CORA+^+), the performance further improves to 43.1 AP50 on COCO and 28.1 box APr on the LVIS benchmark. This indicates the robustness and flexibility of the method in adapting to different settings.

Contributions and Implications

The paper makes several pivotal contributions:

  1. Technical Advancement: By adapting CLIP for region-level tasks, the work introduces a feasible solution for bridging the domain gap between whole-image and region classifications.
  2. Efficiency and Scalability: The anchor pre-matching approach alleviates the need for repetitive per-class inference, thus allowing the system to handle large vocabulary sizes efficiently.
  3. SOTA Performance: Achieving state-of-the-art performance on challenging benchmarks underlines the method's potential for practical applications in scenarios requiring generalized detection over diverse category sets.

Future Directions

Given the promising results, future work can explore further refinements in prompt design and matching strategies, as well as extensions to other visual recognition tasks. The scope for integrating complementary modalities or incorporating additional semantic context for improved object understanding presents an exciting avenue for enhancing the open-vocabulary detection capabilities of advanced models such as CORA.

The research significantly enriches the discourse around applying large-scale vision-LLMs to complex detection tasks, underscoring the transformative potential of these models in advancing artificial intelligence applications.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Xiaoshi Wu (10 papers)
  2. Feng Zhu (138 papers)
  3. Rui Zhao (241 papers)
  4. Hongsheng Li (340 papers)
Citations (93)
Github Logo Streamline Icon: https://streamlinehq.com