T-Rex2: Towards Generic Object Detection via Text-Visual Prompt Synergy

Published 21 Mar 2024 in cs.CV | (2403.14610v1)

Abstract: We present T-Rex2, a highly practical model for open-set object detection. Previous open-set object detection methods relying on text prompts effectively encapsulate the abstract concept of common objects, but struggle with rare or complex object representation due to data scarcity and descriptive limitations. Conversely, visual prompts excel in depicting novel objects through concrete visual examples, but fall short in conveying the abstract concept of objects as effectively as text prompts. Recognizing the complementary strengths and weaknesses of both text and visual prompts, we introduce T-Rex2 that synergizes both prompts within a single model through contrastive learning. T-Rex2 accepts inputs in diverse formats, including text prompts, visual prompts, and the combination of both, so that it can handle different scenarios by switching between the two prompt modalities. Comprehensive experiments demonstrate that T-Rex2 exhibits remarkable zero-shot object detection capabilities across a wide spectrum of scenarios. We show that text prompts and visual prompts can benefit from each other within the synergy, which is essential to cover massive and complicated real-world scenarios and pave the way towards generic object detection. Model API is now available at \url{https://github.com/IDEA-Research/T-Rex}.

Abstract PDF HTML Upgrade to Chat

Authors (6)

References (56)

Citations (19)

View on Semantic Scholar

Summary

The paper introduces a unified text-visual prompt framework that enhances zero-shot object detection in open-set environments.
The methodology integrates dual encoders with a DETR-based architecture and contrastive learning to align textual and visual cues effectively.
Experimental results on benchmarks like COCO and LVIS demonstrate improved accuracy, especially for rare and long-tailed object detection.

T-Rex2: Fusing Text and Visual Prompts for Enhanced Open-Set Object Detection

Introduction

The landscape of object detection in computer vision has experienced a shift from closed-set to open-set paradigms, primarily driven by the versatile and unpredictable nature of real-world scenarios. Traditional methods, while effective within their predefined categories, fall short when encountering novel or rare objects. In response to this challenge, recent advancements have leaned toward leveraging text prompts for open-vocabulary object detection. These approaches, however, grapple with the limitations arising from long-tailed data scarcity and descriptive constraints. Conversely, visual prompts offer a direct and intuitive representation of novel objects but lack the abstract concept conveyance of text prompts. T-Rex2 emerges as a novel solution, synergizing text and visual prompts within a singular framework, thereby harnessing the strengths of both to achieve remarkable zero-shot object detection capabilities across a diverse array of scenarios.

Methodology

T-Rex2 extends upon the DETR model architecture, incorporating dual encoders for processing text and visual prompts, and a unified box decoder for object detection. It uniquely integrates text prompt encoding via CLIP's text encoder and introduces a visual prompt encoder that leverages deformable attention to encapsulate both boxes and points as prompts. A significant innovation in T-Rex2 is the use of contrastive learning to align text and visual prompts, fostering a synergistic relationship where each modality enhances the other's representation and efficacy. Through this alignment, the model navigates the challenges posed by varied scenarios, adapting prompt modality interchangeably.

Experimental Results

Performance evaluations on datasets like COCO, LVIS, ODinW, and Roboflow100, under a zero-shot setting, underscore T-Rex2's prowess. The model demonstrates a superior ability to detect objects using text prompts in common object scenarios while exhibiting remarkable proficiency with visual prompts in long-tailed, rare object contexts. This adaptability is further illustrated through interactive and generic visual prompt workflows, where T-Rex2 not only matches but also surpasses established benchmarks, setting new standards for open-set object detection.

Implications and Future Directions

The confluence of text and visual prompts in T-Rex2 marks a significant stride towards achieving generic object detection. It underscores the potential of combining distinct yet complementary modalities to enhance model performance across varied detection scenarios, especially in addressing the challenges of long-tailed object distributions. The success of T-Rex2 paves the way for the exploration of further multimodal integrations and highlights the importance of data synergy in advancing object detection methodologies. Future research may explore optimizing the alignment process between text and visual prompts and explore the application of T-Rex2’s methodologies to other domains within artificial intelligence and computer vision.

Concluding Remarks

T-Rex2 stands at the intersection of innovation and practicality, offering a scalable and dynamic solution to the ever-evolving challenges of open-set object detection. By elegantly fusing text and visual prompts, it not only broadens the horizon for object detection but also invites a reevaluation of current paradigms, encouraging a more integrated approach to tackling the complexities of real-world visual understanding.

Markdown Report Issue