Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Language-conditioned Detection Transformer (2311.17902v1)

Published 29 Nov 2023 in cs.CV

Abstract: We present a new open-vocabulary detection framework. Our framework uses both image-level labels and detailed detection annotations when available. Our framework proceeds in three steps. We first train a language-conditioned object detector on fully-supervised detection data. This detector gets to see the presence or absence of ground truth classes during training, and conditions prediction on the set of present classes. We use this detector to pseudo-label images with image-level labels. Our detector provides much more accurate pseudo-labels than prior approaches with its conditioning mechanism. Finally, we train an unconditioned open-vocabulary detector on the pseudo-annotated images. The resulting detector, named DECOLA, shows strong zero-shot performance in open-vocabulary LVIS benchmark as well as direct zero-shot transfer benchmarks on LVIS, COCO, Object365, and OpenImages. DECOLA outperforms the prior arts by 17.1 AP-rare and 9.4 mAP on zero-shot LVIS benchmark. DECOLA achieves state-of-the-art results in various model sizes, architectures, and datasets by only training on open-sourced data and academic-scale computing. Code is available at https://github.com/janghyuncho/DECOLA.

Overview of "Language-conditioned Detection Transformer"

The paper, "Language-conditioned Detection Transformer," presents a novel open-vocabulary detection framework that integrates language semantics into the training and operation of object detectors. This framework is specifically designed to address the limitations of traditional object detectors that are constrained to a fixed set of predefined classes. It leverages the generalization capabilities of vision-LLMs to provide an alternative that supports open-vocabulary detection.

Framework and Methodology

The framework operates in a three-step process:

  1. Language-conditioned Training: Initially, a language-conditioned object detector is trained on fully-supervised detection data. The detector is exposed to a set of present classes during training and utilizes this information to condition its predictions. This conditioning process uses the presence or absence of ground truth classes to refine object detection capabilities.
  2. Pseudo-label Generation: The conditioned detector is then employed to generate pseudo-labels for images annotated with only image-level labels. This step significantly benefits from the conditioning mechanism, enabling the framework to produce more accurate pseudo-labels than previous approaches.
  3. Unconditioned Open-vocabulary Training: Finally, an unconditioned open-vocabulary detector is trained on the pseudo-annotations generated in the previous step. This detector is capable of zero-shot detection across various benchmarks (LVIS, COCO, Object365, and OpenImages).

Results and Performance

The results from this framework, referred to as the Detection Transformer Conditioned on Language (DECOLA), illustrate remarkable zero-shot performance improvements on the open-vocabulary LVIS benchmark and demonstrate state-of-the-art results across diverse model sizes and architectures. Notably, DECOLA surpasses existing methods by remarkable margins, with improvements of 17.1 APrare_\text{rare} and 9.4 mAP on zero-shot LVIS benchmarks. These numerical strengths underscore the enhanced generalization and functional capacity that the language-conditioning introduces to the detector.

Implications and Future Directions

From a practical standpoint, DECOLA's framework holds significant potential for scalable, adaptable object detection systems in real-world applications, particularly where the availability of comprehensive labeled data for every new concept is unfeasible. Theoretically, this work expands the intersection of natural language processing and computer vision, suggesting future research paths in tightly integrating language understanding into core vision tasks.

Moving forward, future developments in AI may see further refinements in language-conditioned learning mechanisms, contributing to more sophisticated object detection systems capable of understanding nuanced instructions or descriptions. Additionally, as models and datasets grow, addressing computational scaling and efficiency becomes crucial to maintaining applicability in large-scale, resource-limited settings.

In conclusion, the paper provides a valuable contribution to open-vocabulary object detection by exploiting language-conditioned learning, offering both a methodological innovation and a practical tool for advancing vision-language integration in AI systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (77)
  1. Three ways to improve feature alignment for open vocabulary detection. arXiv, 2023.
  2. Cascade r-cnn: Delving into high quality object detection. In CVPR, 2018.
  3. End-to-end object detection with transformers. In ECCV, 2020.
  4. Hybrid task cascade for instance segmentation. In CVPR, 2019.
  5. Group detr: Fast detr training with group-wise one-to-many assignment. arXiv preprint arXiv:2207.13085, 2022.
  6. PaLI: A jointly-scaled multilingual language-image model. In ICLR, 2023.
  7. Long-tail detection with effective class-margins. In ECCV, 2022.
  8. Partdistillation: Learning parts from instance segmentation. In CVPR, 2023.
  9. Evaluating large-vocabulary object detectors: The devil is in the details. arXiv, 2021.
  10. Mostafa et al Dehghani. Scaling vision transformers to 22 billion parameters. ICML, 2023.
  11. Bert: Pre-training of deep bidirectional transformers for language understanding. 2019.
  12. Learning to prompt for open-vocabulary object detection with vision-language model. 2022.
  13. Eva: Exploring the limits of masked visual representation learning at scale. In CVPR, 2023.
  14. Promptdet: Towards open-vocabulary detection using uncurated images. In ECCV, 2022.
  15. Datacomp: In search of the next generation of multimodal datasets. In Neurips (Datasets and Benchmarks Track), 2023.
  16. Simple copy-paste is a strong data augmentation method for instance segmentation. In CVPR, 2021.
  17. Ross Girshick. Fast r-cnn. In ICCV, 2015.
  18. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014.
  19. Open-vocabulary object detection via vision and language knowledge distillation. In ICLR, 2022.
  20. Lvis: A dataset for large vocabulary instance segmentation. In CVPR, 2019.
  21. Deep residual learning for image recognition. In CVPR, 2016.
  22. Mask r-cnn. In ICCV, 2017.
  23. Distilling the knowledge in a neural network. In Neurips, 2015.
  24. Gabriel et al Ilharco. Openclip, 2021.
  25. Detrs with hybrid matching. 2023.
  26. Mdetr - modulated detection for end-to-end multi-modal understanding. ICCV, 2021.
  27. Contrastive feature masking open-vocabulary vision transformer. In ICCV, 2023a.
  28. Detection-oriented image-text pretraining for open-vocabulary detection. arXiv, 2023b.
  29. Region-aware pretraining for open-vocabulary object detection with vision transformers. In CVPR, 2023c.
  30. Visual genome: Connecting language and vision using crowdsourced dense image annotations. IJCV, 2017.
  31. Open-vocabulary object detection upon frozen vision and language models. In ICLR, 2023.
  32. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. IJCV, 2020.
  33. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, 2022a.
  34. Grounded language-image pre-training. In CVPR, 2022b.
  35. Learning object-language alignments for open-vocabulary object detection. In ICLR, 2023.
  36. Microsoft coco: Common objects in context. In ECCV, 2014.
  37. DAB-DETR: Dynamic anchor boxes are better queries for DETR. In ICLR, 2022.
  38. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023.
  39. Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, 2021.
  40. Decoupled weight decay regularization. In ICLR, 2019.
  41. Codet: Co-occurrence guided region-word alignment for open-vocabulary object detection. In Nuerips, 2023.
  42. Neil Houlsby Matthias Minderer, Alexey Gritsenko. Scaling open-vocabulary object detection. NeurIPS, 2023.
  43. Conditional detr for fast training convergence. In ICCV, 2021.
  44. Matthias Minderer et al. Simple open-vocabulary object detection with vision transformers. ECCV, 2022.
  45. Nms strikes back. arXiv preprint arXiv:2212.06137, 2022.
  46. Pytorch: An imperative style, high-performance deep learning library. In Neurips. 2019.
  47. Learning transferable visual models from natural language supervision. In ICML, 2021.
  48. Bridging the gap between object and image-level representations for open-vocabulary detection. In Neurips, 2022.
  49. Faster r-cnn: Towards real-time object detection with region proposal networks. Neurips, 2015.
  50. Imagenet-21k pretraining for the masses. In Neurips, 2021.
  51. ImageNet Large Scale Visual Recognition Challenge. IJCV, 2015.
  52. Laion-5b: An open large-scale dataset for training next generation image-text models. Neurips, 2022.
  53. Objects365: A large-scale, high-quality dataset for object detection. In ICCV, 2019.
  54. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In ACL, 2018.
  55. Edadet: Open-vocabulary object detection using early dense alignment. In ICCV, 2023.
  56. Equalization loss for long-tailed object recognition. In CVPR, 2020.
  57. Equalization loss v2: A new gradient balance approach for long-tailed object detection. In CVPR, 2021.
  58. Seesaw loss for long-tailed instance segmentation. In CVPR, 2021.
  59. Object-aware distillation pyramid for open-vocabulary object detection. In CVPR, 2023.
  60. Tao Wang. Learning to detect and segment for open vocabulary object detection. In CVPR, 2023.
  61. Anchor detr: Query design for transformer-based detector. In AAAI, 2022.
  62. Aligning bag of regions for open-vocabulary object detection. In CVPR, 2023a.
  63. Cora: Adapting clip for open-vocabulary detection with region prompting and anchor pre-matching. In CVPR, 2023b.
  64. Detectron2, 2019.
  65. Multi-modal queried object detection in the wild. In Neurips, 2023.
  66. Open-vocabulary detr with conditional matching. In ECCV, 2022.
  67. Open-vocabulary object detection using captions. In CVPR, 2021.
  68. Lit: Zero-shot transfer with locked-image text tuning. In CVPR, 2022.
  69. MosaicOS: A simple and effective use of object-centric images for long-tailed object detection. In ICCV, 2021.
  70. Glipv2: Unifying localization and vision-language understanding. Neurips, 2022.
  71. Dino: Detr with improved denoising anchor boxes for end-to-end object detection. 2023.
  72. Regionclip: Region-based language-image pretraining. In CVPR, 2022.
  73. Probabilistic two-stage detection. arXiv preprint arXiv:2103.07461, 2021.
  74. Detecting twenty-thousand classes using image-level supervision. In ECCV, 2022a.
  75. Simple multi-dataset detection. In CVPR, 2022b.
  76. Deformable {detr}: Deformable transformers for end-to-end object detection. In ICLR, 2021.
  77. Rethinking pre-training and self-training. Neurips, 2020.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Jang Hyun Cho (9 papers)
  2. Philipp Krähenbühl (55 papers)
Citations (1)
Github Logo Streamline Icon: https://streamlinehq.com