Vision-by-Language for Training-Free Compositional Image Retrieval (2310.09291v2)
Abstract: Given an image and a target modification (e.g an image of the Eiffel tower and the text "without people and at night-time"), Compositional Image Retrieval (CIR) aims to retrieve the relevant target image in a database. While supervised approaches rely on annotating triplets that is costly (i.e. query image, textual modification, and target image), recent research sidesteps this need by using large-scale vision-LLMs (VLMs), performing Zero-Shot CIR (ZS-CIR). However, state-of-the-art approaches in ZS-CIR still require training task-specific, customized models over large amounts of image-text pairs. In this work, we propose to tackle CIR in a training-free manner via our Compositional Image Retrieval through Vision-by-Language (CIReVL), a simple, yet human-understandable and scalable pipeline that effectively recombines large-scale VLMs with LLMs. By captioning the reference image using a pre-trained generative VLM and asking a LLM to recompose the caption based on the textual target modification for subsequent retrieval via e.g. CLIP, we achieve modular language reasoning. In four ZS-CIR benchmarks, we find competitive, in-part state-of-the-art performance - improving over supervised methods. Moreover, the modularity of CIReVL offers simple scalability without re-training, allowing us to both investigate scaling laws and bottlenecks for ZS-CIR while easily scaling up to in parts more than double of previously reported results. Finally, we show that CIReVL makes CIR human-understandable by composing image and text in a modular fashion in the language domain, thereby making it intervenable, allowing to post-hoc re-align failure cases. Code will be released upon acceptance.
- Vqa: Visual question answering. In ICCV, 2015.
- Compositional learning of image-text query for image retrieval. In WACV, 2021.
- Sentence-level prompts benefit composed image retrieval. In ICLR, 2024.
- A clip-hitchhiker’s guide to long video retrieval. arXiv preprint arXiv:2205.08508, 2022.
- Effective conditioned and composed image retrieval combining clip-based features. In CVPR Workshops, 2022.
- Zero-shot composed image retrieval with textual inversion. In ICCV, 2023.
- Towards language models that can see: Computer vision through the lens of natural language. arXiv preprint arXiv:2306.16410, 2023.
- Cross modal retrieval with querybank normalisation. In CVPR, 2022.
- On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
- Language models are few-shot learners. NeurIPS, 2020.
- Broken neural scaling laws. In ICLR, 2023.
- Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models. In SIGGRAPH, 2023.
- Learning joint visual semantic matching embeddings for language-guided retrieval. In ECCV, 2020.
- Image search with text feedback by visiolinguistic attention learning. In CVPR, 2020.
- Reproducible scaling laws for contrastive language-image learning. In CVPR, 2023.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
- Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
- ”this is my unicorn, fluffy”: Personalizing frozen vision-language representations. In ECCV, 2022.
- ARTEMIS: Attention-based retrieval with text-explicit matching and implicit similarity. In ICLR, 2022.
- A survey on in-context learning, 2023.
- An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
- Make-a-scene: Scene-based text-to-image generation with human priors. In ECCV, 2022.
- An image is worth one word: Personalizing text-to-image generation using textual inversion. In ICLR, 2023.
- Compodiff: Versatile composed image retrieval with latent diffusion. arXiv preprint arXiv:2303.11916, 2023.
- Automatic spatially-aware fashion concept discovery. In ICCV, 2017.
- The many faces of robustness: A critical analysis of out-of-distribution generalization. In ICCV, 2021.
- Sugarcrepe: Fixing hackable benchmarks for vision-language compositionality. In NeurIPS, 2023.
- Tifa: Accurate and interpretable text-to-image faithfulness evaluation with question answering. In ICCV, 2023.
- Jie Huang and Kevin Chen-Chuan Chang. Towards reasoning in large language models: A survey. In ACL Findings, 2023.
- Openclip. URL https://doi.org/10.5281/zenodo.5143773.
- Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, 2021.
- Text encoders are performance bottlenecks in contrastive vision-language models. In EMNLP, 2023.
- Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
- Kg-sp: Knowledge guided simple primitives for open world compositional zero-shot learning. In CVPR, 2022.
- If at first you don’t succeed, try, try again: Faithful diffusion-based text-to-image generation by selection. arXiv preprint arXiv:2305.13308, 2023.
- Cosmo: Content-style modulation for image retrieval with text feedback. In CVPR, 2021.
- Chatting makes perfect–chat-based image retrieval. arXiv preprint arXiv:2305.20062, 2023a.
- Data roaming and early fusion for composed image retrieval. arXiv preprint arXiv:2303.09429, 2023b.
- Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, 2022.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. 2023.
- Microsoft coco: Common objects in context. In ECCV, 2014.
- Compositional visual generation with composable diffusion models. In ECCV, 2022.
- Zero-shot composed text-image retrieval. In BMVC, 2023.
- Image retrieval on real-life images with pre-trained vision-and-language models. In ICCV, 2021.
- Open world compositional zero-shot learning. In CVPR, 2021.
- Visual classification via description from large language models. In ICLR, 2023.
- From red wine to red tomato: Composition with context. In CVPR, 2017.
- OpenAI. GPT-4 Technical Report. arXiv, abs/2303.08774, 2023.
- Pytorch: An imperative style, high-performance deep learning library. In NeurIPS. 2019.
- Learning to predict visual attributes in the wild. In CVPR, 2021.
- What does a platypus look like? generating customized prompts for zero-shot image classification. In ICCV, 2023.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
- High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
- Integrating language guidance into vision-based deep metric learning. In CVPR, 2022a.
- Non-isotropy regularization for proxy-based deep metric learning. In CVPR, 2022b.
- Waffling around for performance: Visual classification with random words and broad concepts. In ICCV, 2023.
- Imagenet large scale visual recognition challenge. IJCV, 2015.
- Pic2word: Mapping pictures to words for zero-shot composed image retrieval. In CVPR, 2023.
- Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021.
- Laion-5b: An open large-scale dataset for training next generation image-text models. In NeurIPS, 2022.
- Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In ACL, 2018.
- Flava: A foundational language and vision alignment model. In CVPR, 2022.
- Kihyuk Sohn. Improved deep metric learning with multi-class n-pair loss objective. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016. URL https://proceedings.neurips.cc/paper_files/paper/2016/file/6b180037abbebea991d8b1232f8a8ca9-Paper.pdf.
- Winoground: Probing vision and language models for visio-linguistic compositionality. In CVPR, 2022.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Sus-x: Training-free name-only transfer of vision-language models. In ICCV, 2023.
- Genecis: A benchmark for general conditional image similarity. In CVPR, 2023.
- Covr: Learning composed video retrieval from web video captions. In AAAI, 2024.
- Show and tell: Lessons learned from the 2015 mscoco image captioning challenge. IEEE T-PAMI, 2016.
- Composing text and image for image retrieval-an empirical odyssey. In CVPR, 2019.
- The fashion iq dataset: Retrieving images by combining side information and relative natural language feedback. CVPR, 2021.
- Cap4video: What can auxiliary captions do for text-video retrieval? In CVPR, 2023.
- Coca: Contrastive captioners are image-text foundation models. TMLR, 2022.
- Socratic models: Composing zero-shot multimodal reasoning with language. In ICLR, 2023.
- Chatgpt asks, blip-2 answers: Automatic questioning towards enriched visual descriptions. arXiv preprint arXiv:2303.06594, 2023.