Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
167 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

T-Rex2: Towards Generic Object Detection via Text-Visual Prompt Synergy (2403.14610v1)

Published 21 Mar 2024 in cs.CV

Abstract: We present T-Rex2, a highly practical model for open-set object detection. Previous open-set object detection methods relying on text prompts effectively encapsulate the abstract concept of common objects, but struggle with rare or complex object representation due to data scarcity and descriptive limitations. Conversely, visual prompts excel in depicting novel objects through concrete visual examples, but fall short in conveying the abstract concept of objects as effectively as text prompts. Recognizing the complementary strengths and weaknesses of both text and visual prompts, we introduce T-Rex2 that synergizes both prompts within a single model through contrastive learning. T-Rex2 accepts inputs in diverse formats, including text prompts, visual prompts, and the combination of both, so that it can handle different scenarios by switching between the two prompt modalities. Comprehensive experiments demonstrate that T-Rex2 exhibits remarkable zero-shot object detection capabilities across a wide spectrum of scenarios. We show that text prompts and visual prompts can benefit from each other within the synergy, which is essential to cover massive and complicated real-world scenarios and pave the way towards generic object detection. Model API is now available at \url{https://github.com/IDEA-Research/T-Rex}.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (56)
  1. End-to-end object detection with transformers. In European Conference on Computer Vision, pages 213–229. Springer, 2020.
  2. Roboflow 100: A rich, multi-domain object detection benchmark. arXiv preprint arXiv:2211.13523, 2022.
  3. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  4. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  5. Scaling open-vocabulary image segmentation with image-level labels. In European Conference on Computer Vision, pages 540–557. Springer, 2022.
  6. Ross Girshick. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 1440–1448, 2015.
  7. Open-vocabulary object detection via vision and language knowledge distillation. arXiv preprint arXiv:2104.13921, 2021.
  8. Lvis: A dataset for large vocabulary instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5356–5364, 2019.
  9. Dimensionality reduction by learning an invariant mapping. In 2006 IEEE computer society conference on computer vision and pattern recognition (CVPR’06), pages 1735–1742. IEEE, 2006.
  10. T-rex: Counting by visual prompting. arXiv preprint arXiv:2311.13596, 2023.
  11. Mdetr-modulated detection for end-to-end multi-modal understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1780–1790, 2021.
  12. Segment anything, 2023.
  13. Openimages: A public dataset for large-scale multi-label and multi-class image classification. Dataset available from https://github. com/openimages, 2(3):18, 2017.
  14. Interactive multi-class tiny-object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14136–14145, 2022.
  15. Elevater: A benchmark and toolkit for evaluating language-augmented visual models. arXiv preprint arXiv:2204.08790, 2022a.
  16. Dn-detr: Accelerate detr training by introducing query denoising. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13619–13627, 2022b.
  17. Visual in-context prompting. arXiv preprint arXiv:2311.13601, 2023a.
  18. Semantic-sam: Segment and recognize anything at any granularity. arXiv preprint arXiv:2307.04767, 2023b.
  19. Grounded language-image pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10965–10975, 2022c.
  20. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
  21. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017.
  22. Countr: Transformer-based generalised visual counting. arXiv preprint arXiv:2208.13721, 2022a.
  23. DAB-DETR: Dynamic anchor boxes are better queries for DETR. In International Conference on Learning Representations, 2022b.
  24. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023.
  25. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10012–10022, 2021.
  26. Towards end-to-end unified scene text detection and layout analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1049–1059, 2022.
  27. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  28. Simple open-vocabulary object detection. In European Conference on Computer Vision, pages 728–755. Springer, 2022.
  29. Few-shot object counting and detection. In European Conference on Computer Vision, pages 348–365. Springer, 2022.
  30. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
  31. Tokenize anything via prompting. arXiv preprint arXiv:2312.09128, 2023.
  32. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  33. Learning to count everything. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3394–3403, 2021.
  34. You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 779–788, 2016.
  35. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems, 28:91–99, 2015.
  36. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 658–666, 2019.
  37. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021.
  38. Crowdhuman: A benchmark for detecting human in a crowd. arXiv preprint arXiv:1805.00123, 2018.
  39. Objects365: A large-scale, high-quality dataset for object detection. In Proceedings of the IEEE/CVF international conference on computer vision, pages 8430–8439, 2019.
  40. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565, 2018.
  41. Represent, compare, and learn: A similarity-aware framework for class-agnostic counting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9529–9538, 2022.
  42. Fcos: Fully convolutional one-stage object detection. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9627–9636, 2019.
  43. Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(11), 2008.
  44. Multi-modal queried object detection in the wild. arXiv preprint arXiv:2305.18980, 2023.
  45. Interactive object detection. In 2012 IEEE conference on computer vision and pattern recognition, pages 3242–3249. IEEE, 2012.
  46. Detclip: Dictionary-enriched visual-concept paralleled pre-training for open-world detection. arXiv preprint arXiv:2209.09407, 2022.
  47. Detclipv2: Scalable open-vocabulary object detection pre-training via word-region alignment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23497–23506, 2023.
  48. Open-vocabulary detr with conditional matching. In European Conference on Computer Vision, pages 106–122. Springer, 2022.
  49. Dino: Detr with improved denoising anchor boxes for end-to-end object detection, 2022a.
  50. A simple framework for open-vocabulary segmentation and detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1020–1031, 2023.
  51. Bamboo: Building mega-scale vision dataset continually with human-machine synergy. arXiv preprint arXiv:2203.07845, 2022b.
  52. Regionclip: Region-based language-image pretraining. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16793–16803, 2022.
  53. Objects as points. arXiv preprint arXiv:1904.07850, 2019.
  54. Detecting twenty-thousand classes using image-level supervision. In European Conference on Computer Vision, pages 350–368. Springer, 2022.
  55. Deformable detr: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159, 2020.
  56. Segment everything everywhere all at once. arXiv preprint arXiv:2304.06718, 2023.
Citations (19)

Summary

  • The paper introduces a unified text-visual prompt framework that enhances zero-shot object detection in open-set environments.
  • The methodology integrates dual encoders with a DETR-based architecture and contrastive learning to align textual and visual cues effectively.
  • Experimental results on benchmarks like COCO and LVIS demonstrate improved accuracy, especially for rare and long-tailed object detection.

T-Rex2: Fusing Text and Visual Prompts for Enhanced Open-Set Object Detection

Introduction

The landscape of object detection in computer vision has experienced a shift from closed-set to open-set paradigms, primarily driven by the versatile and unpredictable nature of real-world scenarios. Traditional methods, while effective within their predefined categories, fall short when encountering novel or rare objects. In response to this challenge, recent advancements have leaned toward leveraging text prompts for open-vocabulary object detection. These approaches, however, grapple with the limitations arising from long-tailed data scarcity and descriptive constraints. Conversely, visual prompts offer a direct and intuitive representation of novel objects but lack the abstract concept conveyance of text prompts. T-Rex2 emerges as a novel solution, synergizing text and visual prompts within a singular framework, thereby harnessing the strengths of both to achieve remarkable zero-shot object detection capabilities across a diverse array of scenarios.

Methodology

T-Rex2 extends upon the DETR model architecture, incorporating dual encoders for processing text and visual prompts, and a unified box decoder for object detection. It uniquely integrates text prompt encoding via CLIP's text encoder and introduces a visual prompt encoder that leverages deformable attention to encapsulate both boxes and points as prompts. A significant innovation in T-Rex2 is the use of contrastive learning to align text and visual prompts, fostering a synergistic relationship where each modality enhances the other's representation and efficacy. Through this alignment, the model navigates the challenges posed by varied scenarios, adapting prompt modality interchangeably.

Experimental Results

Performance evaluations on datasets like COCO, LVIS, ODinW, and Roboflow100, under a zero-shot setting, underscore T-Rex2's prowess. The model demonstrates a superior ability to detect objects using text prompts in common object scenarios while exhibiting remarkable proficiency with visual prompts in long-tailed, rare object contexts. This adaptability is further illustrated through interactive and generic visual prompt workflows, where T-Rex2 not only matches but also surpasses established benchmarks, setting new standards for open-set object detection.

Implications and Future Directions

The confluence of text and visual prompts in T-Rex2 marks a significant stride towards achieving generic object detection. It underscores the potential of combining distinct yet complementary modalities to enhance model performance across varied detection scenarios, especially in addressing the challenges of long-tailed object distributions. The success of T-Rex2 paves the way for the exploration of further multimodal integrations and highlights the importance of data synergy in advancing object detection methodologies. Future research may delve into optimizing the alignment process between text and visual prompts and explore the application of T-Rex2’s methodologies to other domains within artificial intelligence and computer vision.

Concluding Remarks

T-Rex2 stands at the intersection of innovation and practicality, offering a scalable and dynamic solution to the ever-evolving challenges of open-set object detection. By elegantly fusing text and visual prompts, it not only broadens the horizon for object detection but also invites a reevaluation of current paradigms, encouraging a more integrated approach to tackling the complexities of real-world visual understanding.

Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com