Language-conditioned Detection Transformer

Published 29 Nov 2023 in cs.CV | (2311.17902v1)

Abstract: We present a new open-vocabulary detection framework. Our framework uses both image-level labels and detailed detection annotations when available. Our framework proceeds in three steps. We first train a language-conditioned object detector on fully-supervised detection data. This detector gets to see the presence or absence of ground truth classes during training, and conditions prediction on the set of present classes. We use this detector to pseudo-label images with image-level labels. Our detector provides much more accurate pseudo-labels than prior approaches with its conditioning mechanism. Finally, we train an unconditioned open-vocabulary detector on the pseudo-annotated images. The resulting detector, named DECOLA, shows strong zero-shot performance in open-vocabulary LVIS benchmark as well as direct zero-shot transfer benchmarks on LVIS, COCO, Object365, and OpenImages. DECOLA outperforms the prior arts by 17.1 AP-rare and 9.4 mAP on zero-shot LVIS benchmark. DECOLA achieves state-of-the-art results in various model sizes, architectures, and datasets by only training on open-sourced data and academic-scale computing. Code is available at https://github.com/janghyuncho/DECOLA.

Abstract PDF HTML Upgrade to Chat

Authors (2)

References (77)

Citations (1)

View on Semantic Scholar

Summary

The paper introduces a language-conditioned training method that refines object detection through pseudo-label generation and subsequent open-vocabulary training.
It leverages vision-language models to overcome limitations of fixed-class detectors by embedding language semantics into the detection process.
Results demonstrate state-of-the-art zero-shot performance with significant gains, including a 17.1 AP_rare improvement on LVIS benchmarks.

Overview of "Language-conditioned Detection Transformer"

The paper, "Language-conditioned Detection Transformer," presents a novel open-vocabulary detection framework that integrates language semantics into the training and operation of object detectors. This framework is specifically designed to address the limitations of traditional object detectors that are constrained to a fixed set of predefined classes. It leverages the generalization capabilities of vision-LLMs to provide an alternative that supports open-vocabulary detection.

Framework and Methodology

The framework operates in a three-step process:

Language-conditioned Training: Initially, a language-conditioned object detector is trained on fully-supervised detection data. The detector is exposed to a set of present classes during training and utilizes this information to condition its predictions. This conditioning process uses the presence or absence of ground truth classes to refine object detection capabilities.
Pseudo-label Generation: The conditioned detector is then employed to generate pseudo-labels for images annotated with only image-level labels. This step significantly benefits from the conditioning mechanism, enabling the framework to produce more accurate pseudo-labels than previous approaches.
Unconditioned Open-vocabulary Training: Finally, an unconditioned open-vocabulary detector is trained on the pseudo-annotations generated in the previous step. This detector is capable of zero-shot detection across various benchmarks (LVIS, COCO, Object365, and OpenImages).

Results and Performance

The results from this framework, referred to as the Detection Transformer Conditioned on Language (DECOLA), illustrate remarkable zero-shot performance improvements on the open-vocabulary LVIS benchmark and demonstrate state-of-the-art results across diverse model sizes and architectures. Notably, DECOLA surpasses existing methods by remarkable margins, with improvements of 17.1 AP $_\text{rare}$ and 9.4 mAP on zero-shot LVIS benchmarks. These numerical strengths underscore the enhanced generalization and functional capacity that the language-conditioning introduces to the detector.

Implications and Future Directions

From a practical standpoint, DECOLA's framework holds significant potential for scalable, adaptable object detection systems in real-world applications, particularly where the availability of comprehensive labeled data for every new concept is unfeasible. Theoretically, this work expands the intersection of natural language processing and computer vision, suggesting future research paths in tightly integrating language understanding into core vision tasks.

Moving forward, future developments in AI may see further refinements in language-conditioned learning mechanisms, contributing to more sophisticated object detection systems capable of understanding nuanced instructions or descriptions. Additionally, as models and datasets grow, addressing computational scaling and efficiency becomes crucial to maintaining applicability in large-scale, resource-limited settings.

In conclusion, the paper provides a valuable contribution to open-vocabulary object detection by exploiting language-conditioned learning, offering both a methodological innovation and a practical tool for advancing vision-language integration in AI systems.

Markdown Report Issue