MoTaDual: Modality-Task Dual Alignment for Enhanced Zero-shot Composed Image Retrieval (2410.23736v1)
Abstract: Composed Image Retrieval (CIR) is a challenging vision-language task, utilizing bi-modal (image+text) queries to retrieve target images. Despite the impressive performance of supervised CIR, the dependence on costly, manually-labeled triplets limits its scalability and zero-shot capability. To address this issue, zero-shot composed image retrieval (ZS-CIR) is presented along with projection-based approaches. However, such methods face two major problems, i.e., task discrepancy between pre-training (image $\leftrightarrow$ text) and inference (image+text $\rightarrow$ image), and modality discrepancy. The latter pertains to approaches based on text-only projection training due to the necessity of feature extraction from the reference image during inference. In this paper, we propose a two-stage framework to tackle both discrepancies. First, to ensure efficiency and scalability, a textual inversion network is pre-trained on large-scale caption datasets. Subsequently, we put forward Modality-Task Dual Alignment (MoTaDual) as the second stage, where large-LLMs generate triplet data for fine-tuning, and additionally, prompt learning is introduced in a multi-modal context to effectively alleviate both modality and task discrepancies. The experimental results show that our MoTaDual achieves the state-of-the-art performance across four widely used ZS-CIR benchmarks, while maintaining low training time and computational cost. The code will be released soon.
- GPT-4 Technical Report. arXiv:2303.08774.
- Compositional learning of image-text query for image retrieval. In Proceedings of the IEEE/CVF Winter conference on Applications of Computer Vision, 1140–1149.
- Exploring Visual Prompts for Adapting Large-Scale Models. arXiv:2203.17274.
- Zero-shot composed image retrieval with textual inversion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 15338–15347.
- Effective conditioned and composed image retrieval combining clip-based features. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 21466–21474.
- Reducing Task Discrepancy of Text Encoders for Zero-Shot Composed Image Retrieval. arXiv:2406.09188.
- Pretrain like Your Inference: Masked Tuning Improves Zero-Shot Composed Image Retrieval. arXiv:2311.07622.
- Adaptformer: Adapting vision transformers for scalable visual recognition. Advances in Neural Information Processing Systems, 35: 16664–16678.
- Reproducible scaling laws for contrastive language-image learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2818–2829.
- “This is my unicorn, Fluffy”: Personalizing frozen vision-language representations. In European conference on computer vision, 558–577. Springer.
- An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv:2010.11929.
- The Llama 3 Herd of Models. arXiv:2407.21783.
- Pros: Prompting-to-simulate generalized knowledge for universal cross-domain retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 17292–17301.
- An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion. arXiv:2208.01618.
- CompoDiff: Versatile Composed Image Retrieval With Latent Diffusion. arXiv:2303.11916.
- Language-only training of zero-shot composed image retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 13225–13234.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 770–778.
- LoRA: Low-Rank Adaptation of Large Language Models. arXiv:2106.09685.
- Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning, 4904–4916. PMLR.
- Visual prompt tuning. In European Conference on Computer Vision, 709–727. Springer.
- Cross-modal implicit relation reasoning and aligning for text-to-image person retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2787–2797.
- Vision-by-Language for Training-Free Compositional Image Retrieval. arXiv:2310.09291.
- Measuring catastrophic forgetting in neural networks. In Proceedings of the AAAI conference on artificial intelligence, volume 32.
- Maple: Multi-modal prompt learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 19113–19122.
- Dual compositional learning in interactive image retrieval. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, 1771–1779.
- Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123: 32–73.
- The Power of Scale for Parameter-Efficient Prompt Tuning. arXiv:2104.08691.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning, 19730–19742. PMLR.
- Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International conference on machine learning, 12888–12900. PMLR.
- Align before fuse: Vision and language representation learning with momentum distillation. Advances in neural information processing systems, 34: 9694–9705.
- Prefix-Tuning: Optimizing Continuous Prompts for Generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 4582–4597.
- Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, 740–755. Springer.
- Visual instruction tuning. Advances in neural information processing systems, 36.
- Zero-shot Composed Text-Image Retrieval. arXiv:2306.07272.
- Image retrieval on real-life images with pre-trained vision-and-language models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2125–2134.
- Decoupled Weight Decay Regularization. arXiv:1711.05101.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, 8748–8763. PMLR.
- Imagenet large scale visual recognition challenge. International journal of computer vision, 115: 211–252.
- Pic2word: Mapping pictures to words for zero-shot composed image retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 19305–19314.
- Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35: 25278–25294.
- Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2556–2565.
- AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts. arXiv:2010.15980.
- A Corpus for Reasoning About Natural Language Grounded in Photographs. arXiv:1811.00491.
- Knowledge-enhanced dual-stream zero-shot composed image retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 26951–26962.
- Context-I2W: Mapping Images to Context-dependent Words for Accurate Zero-Shot Composed Image Retrieval. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, 5180–5188.
- LLaMA: Open and Efficient Foundation Language Models. arXiv:2302.13971.
- Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv:2307.09288.
- Visualizing data using t-SNE. Journal of machine learning research, 9(11).
- Attention is all you need. Advances in neural information processing systems, 30.
- Genecis: A benchmark for general conditional image similarity. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6862–6872.
- Composing text and image for image retrieval-an empirical odyssey. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 6439–6448.
- Generalizing to unseen domains: A survey on domain generalization. IEEE transactions on knowledge and data engineering, 35(8): 8052–8072.
- Fashion iq: A new dataset towards retrieving images by natural language feedback. In Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, 11307–11317.
- Sentence-level Prompts Benefit Composed Image Retrieval. In The Twelfth International Conference on Learning Representations.
- Deep cross-modal projection learning for image-text matching. In Proceedings of the European conference on computer vision (ECCV), 686–701.
- Conditional prompt learning for vision-language models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 16816–16825.
- Learning to prompt for vision-language models. International Journal of Computer Vision, 130(9): 2337–2348.