Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Vision-by-Language for Training-Free Compositional Image Retrieval (2310.09291v2)

Published 13 Oct 2023 in cs.CV

Abstract: Given an image and a target modification (e.g an image of the Eiffel tower and the text "without people and at night-time"), Compositional Image Retrieval (CIR) aims to retrieve the relevant target image in a database. While supervised approaches rely on annotating triplets that is costly (i.e. query image, textual modification, and target image), recent research sidesteps this need by using large-scale vision-LLMs (VLMs), performing Zero-Shot CIR (ZS-CIR). However, state-of-the-art approaches in ZS-CIR still require training task-specific, customized models over large amounts of image-text pairs. In this work, we propose to tackle CIR in a training-free manner via our Compositional Image Retrieval through Vision-by-Language (CIReVL), a simple, yet human-understandable and scalable pipeline that effectively recombines large-scale VLMs with LLMs. By captioning the reference image using a pre-trained generative VLM and asking a LLM to recompose the caption based on the textual target modification for subsequent retrieval via e.g. CLIP, we achieve modular language reasoning. In four ZS-CIR benchmarks, we find competitive, in-part state-of-the-art performance - improving over supervised methods. Moreover, the modularity of CIReVL offers simple scalability without re-training, allowing us to both investigate scaling laws and bottlenecks for ZS-CIR while easily scaling up to in parts more than double of previously reported results. Finally, we show that CIReVL makes CIR human-understandable by composing image and text in a modular fashion in the language domain, thereby making it intervenable, allowing to post-hoc re-align failure cases. Code will be released upon acceptance.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (76)
  1. Vqa: Visual question answering. In ICCV, 2015.
  2. Compositional learning of image-text query for image retrieval. In WACV, 2021.
  3. Sentence-level prompts benefit composed image retrieval. In ICLR, 2024.
  4. A clip-hitchhiker’s guide to long video retrieval. arXiv preprint arXiv:2205.08508, 2022.
  5. Effective conditioned and composed image retrieval combining clip-based features. In CVPR Workshops, 2022.
  6. Zero-shot composed image retrieval with textual inversion. In ICCV, 2023.
  7. Towards language models that can see: Computer vision through the lens of natural language. arXiv preprint arXiv:2306.16410, 2023.
  8. Cross modal retrieval with querybank normalisation. In CVPR, 2022.
  9. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
  10. Language models are few-shot learners. NeurIPS, 2020.
  11. Broken neural scaling laws. In ICLR, 2023.
  12. Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models. In SIGGRAPH, 2023.
  13. Learning joint visual semantic matching embeddings for language-guided retrieval. In ECCV, 2020.
  14. Image search with text feedback by visiolinguistic attention learning. In CVPR, 2020.
  15. Reproducible scaling laws for contrastive language-image learning. In CVPR, 2023.
  16. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
  17. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
  18. ”this is my unicorn, fluffy”: Personalizing frozen vision-language representations. In ECCV, 2022.
  19. ARTEMIS: Attention-based retrieval with text-explicit matching and implicit similarity. In ICLR, 2022.
  20. A survey on in-context learning, 2023.
  21. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
  22. Make-a-scene: Scene-based text-to-image generation with human priors. In ECCV, 2022.
  23. An image is worth one word: Personalizing text-to-image generation using textual inversion. In ICLR, 2023.
  24. Compodiff: Versatile composed image retrieval with latent diffusion. arXiv preprint arXiv:2303.11916, 2023.
  25. Automatic spatially-aware fashion concept discovery. In ICCV, 2017.
  26. The many faces of robustness: A critical analysis of out-of-distribution generalization. In ICCV, 2021.
  27. Sugarcrepe: Fixing hackable benchmarks for vision-language compositionality. In NeurIPS, 2023.
  28. Tifa: Accurate and interpretable text-to-image faithfulness evaluation with question answering. In ICCV, 2023.
  29. Jie Huang and Kevin Chen-Chuan Chang. Towards reasoning in large language models: A survey. In ACL Findings, 2023.
  30. Openclip. URL https://doi.org/10.5281/zenodo.5143773.
  31. Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, 2021.
  32. Text encoders are performance bottlenecks in contrastive vision-language models. In EMNLP, 2023.
  33. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
  34. Kg-sp: Knowledge guided simple primitives for open world compositional zero-shot learning. In CVPR, 2022.
  35. If at first you don’t succeed, try, try again: Faithful diffusion-based text-to-image generation by selection. arXiv preprint arXiv:2305.13308, 2023.
  36. Cosmo: Content-style modulation for image retrieval with text feedback. In CVPR, 2021.
  37. Chatting makes perfect–chat-based image retrieval. arXiv preprint arXiv:2305.20062, 2023a.
  38. Data roaming and early fusion for composed image retrieval. arXiv preprint arXiv:2303.09429, 2023b.
  39. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, 2022.
  40. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. 2023.
  41. Microsoft coco: Common objects in context. In ECCV, 2014.
  42. Compositional visual generation with composable diffusion models. In ECCV, 2022.
  43. Zero-shot composed text-image retrieval. In BMVC, 2023.
  44. Image retrieval on real-life images with pre-trained vision-and-language models. In ICCV, 2021.
  45. Open world compositional zero-shot learning. In CVPR, 2021.
  46. Visual classification via description from large language models. In ICLR, 2023.
  47. From red wine to red tomato: Composition with context. In CVPR, 2017.
  48. OpenAI. GPT-4 Technical Report. arXiv, abs/2303.08774, 2023.
  49. Pytorch: An imperative style, high-performance deep learning library. In NeurIPS. 2019.
  50. Learning to predict visual attributes in the wild. In CVPR, 2021.
  51. What does a platypus look like? generating customized prompts for zero-shot image classification. In ICCV, 2023.
  52. Learning transferable visual models from natural language supervision. In ICML, 2021.
  53. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
  54. High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
  55. Integrating language guidance into vision-based deep metric learning. In CVPR, 2022a.
  56. Non-isotropy regularization for proxy-based deep metric learning. In CVPR, 2022b.
  57. Waffling around for performance: Visual classification with random words and broad concepts. In ICCV, 2023.
  58. Imagenet large scale visual recognition challenge. IJCV, 2015.
  59. Pic2word: Mapping pictures to words for zero-shot composed image retrieval. In CVPR, 2023.
  60. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021.
  61. Laion-5b: An open large-scale dataset for training next generation image-text models. In NeurIPS, 2022.
  62. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In ACL, 2018.
  63. Flava: A foundational language and vision alignment model. In CVPR, 2022.
  64. Kihyuk Sohn. Improved deep metric learning with multi-class n-pair loss objective. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016. URL https://proceedings.neurips.cc/paper_files/paper/2016/file/6b180037abbebea991d8b1232f8a8ca9-Paper.pdf.
  65. Winoground: Probing vision and language models for visio-linguistic compositionality. In CVPR, 2022.
  66. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  67. Sus-x: Training-free name-only transfer of vision-language models. In ICCV, 2023.
  68. Genecis: A benchmark for general conditional image similarity. In CVPR, 2023.
  69. Covr: Learning composed video retrieval from web video captions. In AAAI, 2024.
  70. Show and tell: Lessons learned from the 2015 mscoco image captioning challenge. IEEE T-PAMI, 2016.
  71. Composing text and image for image retrieval-an empirical odyssey. In CVPR, 2019.
  72. The fashion iq dataset: Retrieving images by combining side information and relative natural language feedback. CVPR, 2021.
  73. Cap4video: What can auxiliary captions do for text-video retrieval? In CVPR, 2023.
  74. Coca: Contrastive captioners are image-text foundation models. TMLR, 2022.
  75. Socratic models: Composing zero-shot multimodal reasoning with language. In ICLR, 2023.
  76. Chatgpt asks, blip-2 answers: Automatic questioning towards enriched visual descriptions. arXiv preprint arXiv:2303.06594, 2023.
Citations (30)

Summary

We haven't generated a summary for this paper yet.