iSEARLE: Improving Textual Inversion for Zero-Shot Composed Image Retrieval (2405.02951v1)
Abstract: Given a query consisting of a reference image and a relative caption, Composed Image Retrieval (CIR) aims to retrieve target images visually similar to the reference one while incorporating the changes specified in the relative caption. The reliance of supervised methods on labor-intensive manually labeled datasets hinders their broad applicability. In this work, we introduce a new task, Zero-Shot CIR (ZS-CIR), that addresses CIR without the need for a labeled training dataset. We propose an approach named iSEARLE (improved zero-Shot composEd imAge Retrieval with textuaL invErsion) that involves mapping the visual information of the reference image into a pseudo-word token in CLIP token embedding space and combining it with the relative caption. To foster research on ZS-CIR, we present an open-domain benchmarking dataset named CIRCO (Composed Image Retrieval on Common Objects in context), the first CIR dataset where each query is labeled with multiple ground truths and a semantic categorization. The experimental results illustrate that iSEARLE obtains state-of-the-art performance on three different CIR datasets -- FashionIQ, CIRR, and the proposed CIRCO -- and two additional evaluation settings, namely domain conversion and object composition. The dataset, the code, and the model are publicly available at https://github.com/miccunifi/SEARLE.
- N. Vo, L. Jiang, C. Sun, K. Murphy, L.-J. Li, L. Fei-Fei, and J. Hays, “Composing text and image for image retrieval-an empirical odyssey,” in Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 6439–6448.
- Z. Liu, C. Rodriguez-Opazo, D. Teney, and S. Gould, “Image retrieval on real-life images with pre-trained vision-and-language models,” in Proc. of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 2125–2134.
- A. Baldrati, M. Bertini, T. Uricchio, and A. Del Bimbo, “Conditioned and composed image retrieval combining and partially fine-tuning clip-based features,” in Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 4959–4968.
- ——, “Effective conditioned and composed image retrieval combining CLIP-based features,” in Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 21 466–21 474.
- ——, “Composed image retrieval using contrastive learning and task-oriented clip-based features,” ACM Transactions on Multimedia Computing, Communications and Applications, vol. 20, no. 3, pp. 1–24, 2023.
- H. Wen, X. Zhang, X. Song, Y. Wei, and L. Nie, “Target-guided composed image retrieval,” in Proceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 915–923.
- G. Delmas, R. S. Rezende, G. Csurka, and D. Larlus, “ARTEMIS: Attention-based retrieval with text-explicit matching and implicit similarity,” in Proc. of International Conference on Learning Representations (ICLR), 2022.
- S. Lee, D. Kim, and B. Han, “Cosmo: Content-style modulation for image retrieval with text feedback,” in Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 802–812.
- A. Baldrati, L. Agnolucci, M. Bertini, and A. Del Bimbo, “Zero-shot composed image retrieval with textual inversion,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2023, pp. 15 338–15 347.
- A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in Proc. of International Conference on Machine Learning (ICML). PMLR, 2021, pp. 8748–8763.
- R. Gal, Y. Alaluf, Y. Atzmon, O. Patashnik, A. H. Bermano, G. Chechik, and D. Cohen-Or, “An image is worth one word: Personalizing text-to-image generation using textual inversion,” in Proc. of International Conference on Learning Representations (ICLR), 2023.
- T. L. Berg, A. C. Berg, and J. Shih, “Automatic attribute discovery and characterization from noisy web data,” in Proc. of the European Conference on Computer Vision (ECCV). Springer, 2010, pp. 663–676.
- X. Han, Z. Wu, P. X. Huang, X. Zhang, M. Zhu, Y. Li, Y. Zhao, and L. S. Davis, “Automatic spatially-aware fashion concept discovery,” in Proc. of the IEEE/CVF International Conference on Computer Vision (ICCV), 2017, pp. 1463–1471.
- H. Wu, Y. Gao, X. Guo, Z. Al-Halah, S. Rennie, K. Grauman, and R. Feris, “Fashion iq: A new dataset towards retrieving images by natural language feedback,” in Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 11 307–11 317.
- X. Guo, H. Wu, Y. Cheng, S. Rennie, G. Tesauro, and R. Feris, “Dialog-based interactive image retrieval,” in Proc. of Advances in Neural Information Processing Systems (NeurIPS), vol. 31, 2018.
- M. Forbes, C. Kaeser-Chen, P. Sharma, and S. Belongie, “Neural naturalist: Generating fine-grained image comparisons,” in Proc. of the Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019, pp. 708–717.
- T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in Proc. of the European Conference on Computer Vision (ECCV). Springer, 2014, pp. 740–755.
- K. Saito, K. Sohn, X. Zhang, C.-L. Li, C.-Y. Lee, K. Saenko, and T. Pfister, “Pic2word: Mapping pictures to words for zero-shot composed image retrieval,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 19 305–19 314.
- V. W. Liang, Y. Zhang, Y. Kwon, S. Yeung, and J. Y. Zou, “Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning,” Advances in Neural Information Processing Systems, vol. 35, pp. 17 612–17 625, 2022.
- S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh, “Vqa: Visual question answering,” in Proc. of the IEEE/CVF International Conference on Computer Vision (ICCV), 2015, pp. 2425–2433.
- Z. Shao, Z. Yu, M. Wang, and J. Yu, “Prompting large language models with answer heuristics for knowledge-based visual question answering,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 14 974–14 983.
- M. Cornia, M. Stefanini, L. Baraldi, and R. Cucchiara, “Meshed-memory transformer for image captioning,” in Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 10 578–10 587.
- X. Hu, Z. Gan, J. Wang, Z. Yang, Z. Liu, Y. Lu, and L. Wang, “Scaling up vision-language pre-training for image captioning,” in Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 17 980–17 989.
- M. Barraco, S. Sarto, M. Cornia, L. Baraldi, and R. Cucchiara, “With a little help from your own past: Prototypical memory networks for image captioning,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 3021–3031.
- R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” in Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 10 684–10 695.
- C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. Denton, S. K. S. Ghasemipour, R. Gontijo-Lopes, B. K. Ayan, T. Salimans et al., “Photorealistic text-to-image diffusion models with deep language understanding,” in Proc. of Advances in Neural Information Processing Systems (NeurIPS), 2022.
- C. Meng, R. Rombach, R. Gao, D. Kingma, S. Ermon, J. Ho, and T. Salimans, “On distillation of guided diffusion models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 14 297–14 306.
- M. Shin, Y. Cho, B. Ko, and G. Gu, “Rtic: Residual learning for text and image composition using graph convolutional network,” arXiv preprint arXiv:2104.03015, 2021.
- M. Levy, R. Ben-Ari, N. Darshan, and D. Lischinski, “Data roaming and early fusion for composed image retrieval,” arXiv preprint arXiv:2303.09429, 2023.
- L. Ventura, A. Yang, C. Schmid, and G. Varol, “Covr: Learning composed video retrieval from web video captions,” arXiv preprint arXiv:2308.14746, 2023.
- Y. Liu, J. Yao, Y. Zhang, Y. Wang, and W. Xie, “Zero-shot composed text-image retrieval,” arXiv preprint arXiv:2306.07272, 2023.
- J. Chen and H. Lai, “Pretrain like you inference: Masked tuning improves zero-shot composed image retrieval,” arXiv preprint arXiv:2311.07622, 2023.
- W. Li, H. Fan, Y. Wong, M. Kankanhalli, and Y. Yang, “CAT-LLM: Context-aware training enhanced large language models for multi-modal contextual image retrieval,” 2024. [Online]. Available: https://openreview.net/forum?id=J88EKENxyF
- S. Karthik, K. Roth, M. Mancini, and Z. Akata, “Vision-by-language for training-free compositional image retrieval,” in The Twelfth International Conference on Learning Representations, 2024. [Online]. Available: https://openreview.net/forum?id=EDPxCjXzSb
- S. Sun, F. Ye, and S. Gong, “Training-free zero-shot composed image retrieval with local concept reranking,” arXiv preprint arXiv:2312.08924, 2023.
- Y. Tang, J. Yu, K. Gai, Z. Jiamin, G. Xiong, Y. Hu, and Q. Wu, “Context-i2w: Mapping images to context-dependent words for accurate zero-shot composed image retrieval,” arXiv preprint arXiv:2309.16137, 2023.
- G. Gu, S. Chun, W. Kim, Y. Kang, and S. Yun, “Language-only efficient training of zero-shot composed image retrieval,” arXiv preprint arXiv:2312.01998, 2023.
- T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners,” in Proc. of Advances in Neural Information Processing Systems (NeurIPS), vol. 33, 2020, pp. 1877–1901.
- N. Ruiz, Y. Li, V. Jampani, Y. Pritch, M. Rubinstein, and K. Aberman, “Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 22 500–22 510.
- N. Kumari, B. Zhang, R. Zhang, E. Shechtman, and J.-Y. Zhu, “Multi-concept customization of text-to-image diffusion,” in Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
- G. Daras and A. Dimakis, “Multiresolution textual inversion,” in NeurIPS 2022 Workshop on Score-Based Methods, 2022.
- N. Cohen, R. Gal, E. A. Meirom, G. Chechik, and Y. Atzmon, “”This is my unicorn, Fluffy”: Personalizing frozen vision-language representations,” in Proc. of the European Conference on Computer Vision (ECCV), 2022.
- B. Korbar and A. Zisserman, “Personalised clip or: how to find your vacation videos,” in Proc. of British Machine Vision Association (BMVA), 2022.
- P. Sharma, N. Ding, S. Goodman, and R. Soricut, “Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning,” in Proc. of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2018, pp. 2556–2565.
- G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” in NIPS Deep Learning and Representation Learning Workshop, 2015. [Online]. Available: http://arxiv.org/abs/1503.02531
- A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio, “FitNets: Hints for thin deep nets,” arXiv preprint arXiv:1412.6550, 2014.
- L. Beyer, X. Zhai, A. Royer, L. Markeeva, R. Anil, and A. Kolesnikov, “Knowledge distillation: A good teacher is patient and consistent,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10 925–10 934.
- G. Chen, W. Choi, X. Yu, T. Han, and M. Chandraker, “Learning efficient object detection models with knowledge distillation,” in Proc. of Advances in Neural Information Processing Systems (NeurIPS), vol. 30, 2017.
- A. Chawla, H. Yin, P. Molchanov, and J. Alvarez, “Data-free knowledge distillation for object detection,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2021, pp. 3289–3298.
- A. Sauer, D. Lorenz, A. Blattmann, and R. Rombach, “Adversarial diffusion distillation,” arXiv preprint arXiv:2311.17042, 2023.
- S. Gu, C. Clark, and A. Kembhavi, “I can’t believe there’s no images! learning visual tasks using only language supervision,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 2672–2683.
- O. Tov, Y. Alaluf, Y. Nitzan, O. Patashnik, and D. Cohen-Or, “Designing an encoder for stylegan image manipulation,” ACM Transactions on Graphics (TOG), vol. 40, no. 4, pp. 1–14, 2021.
- J. Zhu, Y. Shen, D. Zhao, and B. Zhou, “In-domain gan inversion for real image editing,” in Proc. of the European Conference on Computer Vision (ECCV). Springer, 2020, pp. 592–608.
- A. Kuznetsova, H. Rom, N. Alldrin, J. Uijlings, I. Krasin, J. Pont-Tuset, S. Kamali, S. Popov, M. Malloci, A. Kolesnikov et al., “The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale,” International Journal of Computer Vision (IJCV), vol. 128, no. 7, pp. 1956–1981, 2020.
- D. Hendrycks and K. Gimpel, “Gaussian error linear units (GELUs),” arXiv preprint arXiv:1606.08415, 2016.
- T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” in Proc. of International Conference on Machine Learning (ICML). PMLR, 2020, pp. 1597–1607.
- J. D. Robinson, C.-Y. Chuang, S. Sra, and S. Jegelka, “Contrastive learning with hard negative samples,” in International Conference on Learning Representations, 2020.
- Y. Kalantidis, M. B. Sariyildiz, N. Pion, P. Weinzaepfel, and D. Larlus, “Hard negative mixing for contrastive learning,” Advances in Neural Information Processing Systems, vol. 33, pp. 21 798–21 809, 2020.
- J. MacQueen et al., “Some methods for classification and analysis of multivariate observations,” in Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, vol. 1, no. 14. Oakland, CA, USA, 1967, pp. 281–297.
- R. Gal, M. Arar, Y. Atzmon, A. H. Bermano, G. Chechik, and D. Cohen-Or, “Encoder-based domain tuning for fast personalization of text-to-image models,” ACM Transactions on Graphics (TOG), vol. 42, no. 4, pp. 1–13, 2023.
- O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “Imagenet large scale visual recognition challenge,” International Journal of Computer Vision (IJCV), vol. 115, pp. 211–252, 2015.
- K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
- D. Hendrycks, S. Basart, N. Mu, S. Kadavath, F. Wang, E. Dorundo, R. Desai, T. Zhu, S. Parajuli, M. Guo et al., “The many faces of robustness: A critical analysis of out-of-distribution generalization,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 8340–8349.
- I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” in International Conference on Learning Representations, 2018.
- J. Yu, Z. Wang, V. Vasudevan, L. Yeung, M. Seyedhosseini, and Y. Wu, “Coca: Contrastive captioners are image-text foundation models,” Transactions on Machine Learning Research, vol. Aug 2022, 2022. [Online]. Available: https://arxiv.org/abs/2205.01917
- Z. Wang, A. Chen, F. Hu, and X. Li, “Learn to understand negation in video retrieval,” in Proceedings of the 30th ACM International Conference on Multimedia, 2022, pp. 434–443.
- M. Yuksekgonul, F. Bianchi, P. Kalluri, D. Jurafsky, and J. Zou, “When and why vision-language models behave like bags-of-words, and what to do about it?” in The Eleventh International Conference on Learning Representations, 2022.
- G. Van Horn, S. Branson, R. Farrell, S. Haber, J. Barry, P. Ipeirotis, P. Perona, and S. Belongie, “Building a bird recognition app and large scale dataset with citizen scientists: The fine print in fine-grained dataset collection,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 595–604.
- T. Karras, S. Laine, and T. Aila, “A style-based generator architecture for generative adversarial networks,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 4401–4410.
- Q. Cao, L. Shen, W. Xie, O. M. Parkhi, and A. Zisserman, “Vggface2: A dataset for recognising faces across pose and age,” in 2018 13th IEEE international conference on automatic face & gesture recognition (FG 2018). IEEE, 2018, pp. 67–74.