Open-Vocabulary DETR with Conditional Matching (2203.11876v2)

Published 22 Mar 2022 in cs.CV and cs.AI

Abstract: Open-vocabulary object detection, which is concerned with the problem of detecting novel objects guided by natural language, has gained increasing attention from the community. Ideally, we would like to extend an open-vocabulary detector such that it can produce bounding box predictions based on user inputs in form of either natural language or exemplar image. This offers great flexibility and user experience for human-computer interaction. To this end, we propose a novel open-vocabulary detector based on DETR -- hence the name OV-DETR -- which, once trained, can detect any object given its class name or an exemplar image. The biggest challenge of turning DETR into an open-vocabulary detector is that it is impossible to calculate the classification cost matrix of novel classes without access to their labeled images. To overcome this challenge, we formulate the learning objective as a binary matching one between input queries (class name or exemplar image) and the corresponding objects, which learns useful correspondence to generalize to unseen queries during testing. For training, we choose to condition the Transformer decoder on the input embeddings obtained from a pre-trained vision-LLM like CLIP, in order to enable matching for both text and image queries. With extensive experiments on LVIS and COCO datasets, we demonstrate that our OV-DETR -- the first end-to-end Transformer-based open-vocabulary detector -- achieves non-trivial improvements over current state of the arts.

PDF Abstract

Overview of Open-Vocabulary DETR with Conditional Matching

The paper entitled "Open-Vocabulary DETR with Conditional Matching" presents a novel approach to open-vocabulary object detection, offering a flexible detection mechanism capable of identifying novel objects based on either textual descriptions or exemplar images. The primary innovation involves transforming the DETR architecture—a Transformer-based object detector—into an open-vocabulary variant by introducing conditional matching facilitated through vision-LLMs like CLIP.

Problem Statement

Traditional object detection frameworks are inherently limited to a closed set of object classes pre-defined in datasets during training. Existing methods addressing zero-shot or open-vocabulary detection often depend on external resources or specific classification paradigms, restricting their generalization capabilities across unseen classes. This paper proposes a solution to detect arbitrary objects without a priori labeled data using a DETR-based framework that accommodates open-vocabulary input.

Methodology

The key technical contributions revolve around the conditional matching mechanism employed in OV-DETR. Unlike standard DETR, which requires ground-truth labels for closed-set classes to compute classification costs, OV-DETR bypasses this bottleneck through conditional matching. The mechanism involves conditioning the Transformer decoder with embeddings derived from class names or exemplar images using pre-trained models like CLIP.

Conditional Inputs: During training, conditional embeddings extracted from textual class names and image regions using CLIP are used to extend the object query embeddings in DETR, thus enabling predictions based on unfamiliar classes during inference.
Binary Matching Loss: The matching process is redefined as a binary matching problem, where each input query attempts to match proposed bounding boxes. This approach ensures robust generalization to novel classes by training on a binary loss between predicted and actual matches.
Training Configuration: The Transformer decoder is optimized to process multiple conditional queries simultaneously, reducing computational overhead while maintaining high detection precision.

Experimental Setup

Extensive experiments are conducted on OV-LVIS and OV-COCO datasets, emphasizing the capabilities of OV-DETR in both identification accuracy and computational efficiency. In OV-LVIS, consisting of multiple class frequency categories, OV-DETR shows significant improvement over state-of-the-art methods like ViLD. On OV-COCO, OV-DETR surpasses others significantly, illustrating the robustness of this conditional matching approach.

Numerical Results

OV-DETR achieves substantial empirical gains:

On OV-LVIS, OV-DETR gains a mask mAP of 17.4% for novel classes.
On OV-COCO, it achieves a box mAP improvement of 29.4% on novel categories.

These results indicate that OV-DETR not only performs well on unseen classes but does so without diminishing performance on known classes.

Implications and Future Directions

OV-DETR opens possibilities for more dynamic human-computer interaction systems, where detection tasks are driven by user-defined inputs. This flexibility is particularly advantageous for settings requiring rapid adaptation to new categories without explicit retraining. The method also implies potential advancements in semi-supervised and few-shot learning.

Future developments could benefit from exploring optimization techniques to further decrease inference times, making OV-DETR more practical for real-time applications. Further alignment with multimodal training paradigms and integration with advanced prompting techniques could empower OV-DETR to handle even wider vocabularies efficiently.

Conclusion

"Open-Vocabulary DETR with Conditional Matching" sets a precedent by enabling end-to-end open-vocabulary object detection through a novel conditional matching framework. The paper's methodology represents a significant stride toward scalable and adaptable detection frameworks, harnessing the power of vision-LLMs to overcome longstanding limitations in object detection paradigms.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Yuhang Zang (54 papers)
Wei Li (1121 papers)
Kaiyang Zhou (40 papers)
Chen Huang (88 papers)
Chen Change Loy (288 papers)

Citations (168)

View on Semantic Scholar