Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DesCo: Learning Object Recognition with Rich Language Descriptions (2306.14060v1)

Published 24 Jun 2023 in cs.CV, cs.CL, and cs.LG

Abstract: Recent development in vision-language approaches has instigated a paradigm shift in learning visual recognition models from language supervision. These approaches align objects with language queries (e.g. "a photo of a cat") and improve the models' adaptability to identify novel objects and domains. Recently, several studies have attempted to query these models with complex language expressions that include specifications of fine-grained semantic details, such as attributes, shapes, textures, and relations. However, simply incorporating language descriptions as queries does not guarantee accurate interpretation by the models. In fact, our experiments show that GLIP, the state-of-the-art vision-LLM for object detection, often disregards contextual information in the language descriptions and instead relies heavily on detecting objects solely by their names. To tackle the challenges, we propose a new description-conditioned (DesCo) paradigm of learning object recognition models with rich language descriptions consisting of two major innovations: 1) we employ a LLM as a commonsense knowledge engine to generate rich language descriptions of objects based on object names and the raw image-text caption; 2) we design context-sensitive queries to improve the model's ability in deciphering intricate nuances embedded within descriptions and enforce the model to focus on context rather than object names alone. On two novel object detection benchmarks, LVIS and OminiLabel, under the zero-shot detection setting, our approach achieves 34.8 APr minival (+9.1) and 29.3 AP (+3.6), respectively, surpassing the prior state-of-the-art models, GLIP and FIBER, by a large margin.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (54)
  1. Zero-shot object detection. In ECCV, 2018.
  2. Language models are few-shot learners. In NeurIPS, 2020.
  3. X-DETR: A versatile architecture for instance-wise vision-language tasks. In ECCV, 2022.
  4. Dynamic Head: Unifying object detection heads with attentions. In CVPR, 2021.
  5. ImageNet: A large-scale hierarchical image database. In CVPR, 2009.
  6. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL, 2019.
  7. Coarse-to-fine vision-language pre-training with fusion in the backbone. In NeurIPS, 2022.
  8. Open-vocabulary image segmentation. In ECCV, 2022.
  9. Open-vocabulary object detection via vision and language knowledge distillation. In ICLR, 2022.
  10. LVIS: A dataset for large vocabulary instance segmentation. In CVPR, 2019.
  11. Towards general purpose vision systems: An end-to-end task-agnostic vision-language architecture. In CVPR, 2022.
  12. Graeme Hirst. Anaphora in natural language understanding: a survey. 1981.
  13. LoRA: Low-rank adaptation of large language models. In ICLR, 2022.
  14. GQA: a new dataset for compositional question answering over real-world images. In CVPR, 2019.
  15. Open-vocabulary instance segmentation via robust cross-modal pseudo-labeling. In CVPR, 2022.
  16. MDETR–Modulated detection for end-to-end multi-modal understanding. In CVPR, 2021.
  17. Webly supervised concept expansion for general purpose vision models. In ECCV, 2022.
  18. Visual Genome: Connecting language and vision using crowdsourced dense image annotations. IJCV, 2017.
  19. Learning multiple layers of features from tiny images. 2009.
  20. Language-driven semantic segmentation. In ICLR, 2022.
  21. ELEVATER: A benchmark and toolkit for evaluating language-augmented visual models. NeurIPS, 2022.
  22. Grounded language-image pre-training. In CVPR, 2022.
  23. Microsoft COCO: Common objects in context. In ECCV, 2014.
  24. Visual instruction tuning. arXiv preprint, 2023.
  25. Grounding DINO: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint, 2023.
  26. RoBERTa: A robustly optimized bert pretraining approach. arXiv preprint, 2019.
  27. Swin Transformer: Hierarchical vision transformer using shifted windows. In ICCV, 2021.
  28. Visual classification via description from large language models. In ICLR, 2023.
  29. Simple open-vocabulary object detection with vision transformers. In ECCV, 2022.
  30. The role of context for object detection and semantic segmentation in the wild. In CVPR, 2014.
  31. The world of an octopus: How reporting bias influences a language model’s perception of color. In EMNLP, 2021.
  32. Flickr30k Entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In ICCV, 2015.
  33. Learning transferable visual models from natural language supervision. In ICML, 2021.
  34. OmniLabel: A challenging benchmark for language-based object detection. arXiv preprint, 2023.
  35. Objects365: A large-scale, high-quality dataset for object detection. In ICCV, 2019.
  36. Conceptual Captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In ACL, 2018.
  37. K-LITE: Learning transferable visual models with external knowledge. In NeurIPS, 2022.
  38. Zero-shot learning through cross-modal transfer. NeurIPS, 2013.
  39. Winoground: Probing vision and language models for visio-linguistic compositionality. In CVPR, 2022.
  40. LLaMA: Open and efficient foundation language models. arXiv preprint, 2023.
  41. VisionLLM: Large language model is also an open-ended decoder for vision-centric tasks. arXiv preprint, 2023.
  42. GroupViT: Semantic segmentation emerges from text supervision. In CVPR, 2022.
  43. Language in a Bottle: Language model guided concept bottlenecks for interpretable image classification. In CVPR, 2023.
  44. DetCLIP: Dictionary-enriched visual-concept paralleled pre-training for open-world detection. In NeurIPS, 2022.
  45. Meta-learning without memorization. In ICLR, 2020.
  46. Modeling context in referring expressions. In ECCV, 2016.
  47. When and why vision-language models behave like bag-of-words models, and what to do about it? In ICLR, 2023.
  48. GLIPv2: Unifying localization and vision-language understanding. NeurIPS, 2022.
  49. OmDet: Language-aware object detection with large-scale vision-language multi-dataset pre-training. arXiv preprint, 2022.
  50. RegionCLIP: Region-based language-image pretraining. In CVPR, 2022.
  51. Scene parsing through ade20k dataset. In CVPR, 2017.
  52. Detecting twenty-thousand classes using image-level supervision. In ECCV, 2022.
  53. Generalized decoding for pixel, image, and language. In CVPR, 2023.
  54. Segment everything everywhere all at once. arXiv preprint, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Liunian Harold Li (19 papers)
  2. Zi-Yi Dou (33 papers)
  3. Nanyun Peng (205 papers)
  4. Kai-Wei Chang (292 papers)
Citations (15)
Github Logo Streamline Icon: https://streamlinehq.com