Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
8 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MoTaDual: Modality-Task Dual Alignment for Enhanced Zero-shot Composed Image Retrieval (2410.23736v1)

Published 31 Oct 2024 in cs.CV and cs.IR

Abstract: Composed Image Retrieval (CIR) is a challenging vision-language task, utilizing bi-modal (image+text) queries to retrieve target images. Despite the impressive performance of supervised CIR, the dependence on costly, manually-labeled triplets limits its scalability and zero-shot capability. To address this issue, zero-shot composed image retrieval (ZS-CIR) is presented along with projection-based approaches. However, such methods face two major problems, i.e., task discrepancy between pre-training (image $\leftrightarrow$ text) and inference (image+text $\rightarrow$ image), and modality discrepancy. The latter pertains to approaches based on text-only projection training due to the necessity of feature extraction from the reference image during inference. In this paper, we propose a two-stage framework to tackle both discrepancies. First, to ensure efficiency and scalability, a textual inversion network is pre-trained on large-scale caption datasets. Subsequently, we put forward Modality-Task Dual Alignment (MoTaDual) as the second stage, where large-LLMs generate triplet data for fine-tuning, and additionally, prompt learning is introduced in a multi-modal context to effectively alleviate both modality and task discrepancies. The experimental results show that our MoTaDual achieves the state-of-the-art performance across four widely used ZS-CIR benchmarks, while maintaining low training time and computational cost. The code will be released soon.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (57)
  1. GPT-4 Technical Report. arXiv:2303.08774.
  2. Compositional learning of image-text query for image retrieval. In Proceedings of the IEEE/CVF Winter conference on Applications of Computer Vision, 1140–1149.
  3. Exploring Visual Prompts for Adapting Large-Scale Models. arXiv:2203.17274.
  4. Zero-shot composed image retrieval with textual inversion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 15338–15347.
  5. Effective conditioned and composed image retrieval combining clip-based features. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 21466–21474.
  6. Reducing Task Discrepancy of Text Encoders for Zero-Shot Composed Image Retrieval. arXiv:2406.09188.
  7. Pretrain like Your Inference: Masked Tuning Improves Zero-Shot Composed Image Retrieval. arXiv:2311.07622.
  8. Adaptformer: Adapting vision transformers for scalable visual recognition. Advances in Neural Information Processing Systems, 35: 16664–16678.
  9. Reproducible scaling laws for contrastive language-image learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2818–2829.
  10. “This is my unicorn, Fluffy”: Personalizing frozen vision-language representations. In European conference on computer vision, 558–577. Springer.
  11. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv:2010.11929.
  12. The Llama 3 Herd of Models. arXiv:2407.21783.
  13. Pros: Prompting-to-simulate generalized knowledge for universal cross-domain retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 17292–17301.
  14. An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion. arXiv:2208.01618.
  15. CompoDiff: Versatile Composed Image Retrieval With Latent Diffusion. arXiv:2303.11916.
  16. Language-only training of zero-shot composed image retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 13225–13234.
  17. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 770–778.
  18. LoRA: Low-Rank Adaptation of Large Language Models. arXiv:2106.09685.
  19. Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning, 4904–4916. PMLR.
  20. Visual prompt tuning. In European Conference on Computer Vision, 709–727. Springer.
  21. Cross-modal implicit relation reasoning and aligning for text-to-image person retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2787–2797.
  22. Vision-by-Language for Training-Free Compositional Image Retrieval. arXiv:2310.09291.
  23. Measuring catastrophic forgetting in neural networks. In Proceedings of the AAAI conference on artificial intelligence, volume 32.
  24. Maple: Multi-modal prompt learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 19113–19122.
  25. Dual compositional learning in interactive image retrieval. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, 1771–1779.
  26. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123: 32–73.
  27. The Power of Scale for Parameter-Efficient Prompt Tuning. arXiv:2104.08691.
  28. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning, 19730–19742. PMLR.
  29. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International conference on machine learning, 12888–12900. PMLR.
  30. Align before fuse: Vision and language representation learning with momentum distillation. Advances in neural information processing systems, 34: 9694–9705.
  31. Prefix-Tuning: Optimizing Continuous Prompts for Generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 4582–4597.
  32. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, 740–755. Springer.
  33. Visual instruction tuning. Advances in neural information processing systems, 36.
  34. Zero-shot Composed Text-Image Retrieval. arXiv:2306.07272.
  35. Image retrieval on real-life images with pre-trained vision-and-language models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2125–2134.
  36. Decoupled Weight Decay Regularization. arXiv:1711.05101.
  37. Learning transferable visual models from natural language supervision. In International conference on machine learning, 8748–8763. PMLR.
  38. Imagenet large scale visual recognition challenge. International journal of computer vision, 115: 211–252.
  39. Pic2word: Mapping pictures to words for zero-shot composed image retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 19305–19314.
  40. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35: 25278–25294.
  41. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2556–2565.
  42. AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts. arXiv:2010.15980.
  43. A Corpus for Reasoning About Natural Language Grounded in Photographs. arXiv:1811.00491.
  44. Knowledge-enhanced dual-stream zero-shot composed image retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 26951–26962.
  45. Context-I2W: Mapping Images to Context-dependent Words for Accurate Zero-Shot Composed Image Retrieval. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, 5180–5188.
  46. LLaMA: Open and Efficient Foundation Language Models. arXiv:2302.13971.
  47. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv:2307.09288.
  48. Visualizing data using t-SNE. Journal of machine learning research, 9(11).
  49. Attention is all you need. Advances in neural information processing systems, 30.
  50. Genecis: A benchmark for general conditional image similarity. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6862–6872.
  51. Composing text and image for image retrieval-an empirical odyssey. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 6439–6448.
  52. Generalizing to unseen domains: A survey on domain generalization. IEEE transactions on knowledge and data engineering, 35(8): 8052–8072.
  53. Fashion iq: A new dataset towards retrieving images by natural language feedback. In Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, 11307–11317.
  54. Sentence-level Prompts Benefit Composed Image Retrieval. In The Twelfth International Conference on Learning Representations.
  55. Deep cross-modal projection learning for image-text matching. In Proceedings of the European conference on computer vision (ECCV), 686–701.
  56. Conditional prompt learning for vision-language models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 16816–16825.
  57. Learning to prompt for vision-language models. International Journal of Computer Vision, 130(9): 2337–2348.

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com