Zero-shot Composed Text-Image Retrieval
Abstract: In this paper, we consider the problem of composed image retrieval (CIR), it aims to train a model that can fuse multi-modal information, e.g., text and images, to accurately retrieve images that match the query, extending the user's expression ability. We make the following contributions: (i) we initiate a scalable pipeline to automatically construct datasets for training CIR model, by simply exploiting a large-scale dataset of image-text pairs, e.g., a subset of LAION-5B; (ii) we introduce a transformer-based adaptive aggregation model, TransAgg, which employs a simple yet efficient fusion mechanism, to adaptively combine information from diverse modalities; (iii) we conduct extensive ablation studies to investigate the usefulness of our proposed data construction procedure, and the effectiveness of core components in TransAgg; (iv) when evaluating on the publicly available benckmarks under the zero-shot scenario, i.e., training on the automatically constructed datasets, then directly conduct inference on target downstream datasets, e.g., CIRR and FashionIQ, our proposed approach either performs on par with or significantly outperforms the existing state-of-the-art (SOTA) models. Project page: https://code-kunkun.github.io/ZS-CIR/
- Conditioned and composed image retrieval combining and partially fine-tuning clip-based features. In CVPR Workshops, 2022.
- Zero-shot composed image retrieval with textual inversion. arXiv preprint arXiv:2303.15247, 2023.
- Language models are few-shot learners. In NeurIPS, 2020.
- Multimodal pretraining unmasked: A meta-analysis and a unified framework of vision-and-language berts. Transactions of the Association for Computational Linguistics, 2021.
- "this is my unicorn, fluffy": Personalizing frozen vision-language representations. In ECCV, 2022.
- Artemis: Attention-based retrieval with text-explicit matching and implicit similarity. In ICLR, 2022.
- Dual encoding for zero-example video retrieval. In CVPR, 2019.
- Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In CVPR, 2017.
- Compodiff: Versatile composed image retrieval with latent diffusion. arXiv preprint arXiv:2303.11916, 2023.
- Deep residual learning for image recognition. In CVPR, 2016.
- Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626, 2022.
- spaCy: Industrial-strength Natural Language Processing in Python. 2020. 10.5281/zenodo.1212303.
- Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, 2021.
- Associating neural word embeddings with deep image representations using fisher vectors. In CVPR, 2015.
- Data roaming and early fusion for composed image retrieval. arXiv preprint arXiv:2303.09429, 2023.
- Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, 2022.
- Oscar: Object-semantics aligned pre-training for vision-language tasks. In ECCV, 2020.
- Image retrieval on real-life images with pre-trained vision-and-language models. In ICCV, 2021.
- Bi-directional training for composed image retrieval via text prompt learning. arXiv preprint arXiv:2303.16604, 2023.
- Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In ICCV, 2019.
- M3p: Learning universal representations via multitask multilingual multimodal pre-training. In CVPR, 2021.
- Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
- Jointly modeling embedding and translation to bridge video and language. In CVPR, 2016.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- Pic2word: Mapping pictures to words for zero-shot composed image retrieval. In CVPR, 2023.
- A corpus for reasoning about natural language grounded in photographs. arXiv preprint arXiv:1811.00491, 2018.
- Composing text and image for image retrieval-an empirical odyssey. In CVPR, 2019.
- Fashion iq: A new dataset towards retrieving images by natural language feedback. In CVPR, 2021.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.