Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CaLa: Complementary Association Learning for Augmenting Composed Image Retrieval (2405.19149v2)

Published 29 May 2024 in cs.CV, cs.AI, and cs.IR

Abstract: Composed Image Retrieval (CIR) involves searching for target images based on an image-text pair query. While current methods treat this as a query-target matching problem, we argue that CIR triplets contain additional associations beyond this primary relation. In our paper, we identify two new relations within triplets, treating each triplet as a graph node. Firstly, we introduce the concept of text-bridged image alignment, where the query text serves as a bridge between the query image and the target image. We propose a hinge-based cross-attention mechanism to incorporate this relation into network learning. Secondly, we explore complementary text reasoning, considering CIR as a form of cross-modal retrieval where two images compose to reason about complementary text. To integrate these perspectives effectively, we design a twin attention-based compositor. By combining these complementary associations with the explicit query pair-target image relation, we establish a comprehensive set of constraints for CIR. Our framework, CaLa (Complementary Association Learning for Augmenting Composed Image Retrieval), leverages these insights. We evaluate CaLa on CIRR and FashionIQ benchmarks with multiple backbones, demonstrating its superiority in composed image retrieval.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (40)
  1. Sentence-level Prompts Benefit Composed Image Retrieval. arXiv preprint arXiv:2310.05473 (2023).
  2. Conditioned image retrieval for fashion using contrastive learning and CLIP-based features. In ACM Multimedia Asia. 1–5.
  3. Conditioned and composed image retrieval combining and partially fine-tuning clip-based features. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4959–4968.
  4. SPIRIT: Style-guided Patch Interaction for Fashion Image Retrieval with Text Feedback. ACM Transactions on Multimedia Computing, Communications and Applications (2024).
  5. ARTEMIS: Attention-based Retrieval with Text-Explicit Matching and Implicit Similarity. In International Conference on Learning Representations. https://openreview.net/forum?id=CVfLvQq9gLo
  6. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
  7. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net. https://openreview.net/forum?id=YicbFdNTTy
  8. Dialog-based interactive image retrieval. Advances in neural information processing systems 31 (2018).
  9. Logical Entity Representation in Knowledge-Graphs for Differentiable Rule Learning. ICLR (2023).
  10. Automatic spatially-aware fashion concept discovery. In Proceedings of the IEEE international conference on computer vision. 1463–1471.
  11. SAC: Semantic attention composition for text-conditioned image retrieval. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 4021–4030.
  12. Dual compositional learning in interactive image retrieval. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 1771–1779.
  13. Cosmo: Content-style modulation for image retrieval with text feedback. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 802–812.
  14. Data Roaming and Early Fusion for Composed Image Retrieval. arXiv preprint arXiv:2303.09429 (2023).
  15. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA (Proceedings of Machine Learning Research, Vol. 202), Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (Eds.). PMLR, 19730–19742. https://proceedings.mlr.press/v202/li23q.html
  16. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. In International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA (Proceedings of Machine Learning Research, Vol. 162), Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvári, Gang Niu, and Sivan Sabato (Eds.). PMLR, 12888–12900.
  17. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021. IEEE, 9992–10002. https://doi.org/10.1109/ICCV48922.2021.00986
  18. Image retrieval on real-life images with pre-trained vision-and-language models. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2125–2134.
  19. Bi-directional training for composed image retrieval via text prompt learning. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 5753–5762.
  20. Candidate Set Re-ranking for Composed Image Retrieval with Dual Multi-modal Encoder. arXiv preprint arXiv:2305.16304 (2023).
  21. Learnable Pillar-based Re-ranking for Image-Text Retrieval. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1252–1261.
  22. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748–8763.
  23. Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event (Proceedings of Machine Learning Research, Vol. 139), Marina Meila and Tong Zhang (Eds.). PMLR, 8748–8763.
  24. Joint language semantic and structure embedding for knowledge graph completion. COLING (2022).
  25. Rtic: Residual learning for text and image composition using graph convolutional network. arXiv preprint arXiv:2104.03015 (2021).
  26. Dual attention matching network for context-aware feature sequence based person re-identification. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5363–5372.
  27. A Corpus for Reasoning about Natural Language Grounded in Photographs. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, Anna Korhonen, David R. Traum, and Lluís Màrquez (Eds.). Association for Computational Linguistics, 6418–6428.
  28. Attention is all you need. Advances in neural information processing systems 30 (2017).
  29. Attention Is All You Need. arXiv:1706.03762 [cs.CL]
  30. All in one: Exploring unified video-language pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6598–6608.
  31. Image as a foreign language: Beit pretraining for vision and vision-language tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 19175–19186.
  32. PFAN++: Bi-Directional Image-Text Retrieval With Position Focused Attention Network. IEEE Trans. Multim. 23 (2021), 3362–3376.
  33. Position Focused Attention Network for Image-Text Matching. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI 2019, Macao, China, August 10-16, 2019, Sarit Kraus (Ed.). ijcai.org, 3792–3798.
  34. Target-guided composed image retrieval. In Proceedings of the 31st ACM International Conference on Multimedia. 915–923.
  35. The Fashion IQ Dataset: Retrieving Images by Combining Side Information and Relative Natural Language Feedback. CVPR (2021).
  36. Towards Unified Text-based Person Retrieval: A Large-scale Multi-Attribute and Language Search Benchmark. In Proceedings of the 31st ACM International Conference on Multimedia, MM 2023, Ottawa, ON, Canada, 29 October 2023- 3 November 2023, Abdulmotaleb El-Saddik, Tao Mei, Rita Cucchiara, Marco Bertini, Diana Patricia Tobon Vallejo, Pradeep K. Atrey, and M. Shamim Hossain (Eds.). ACM, 4492–4501. https://doi.org/10.1145/3581783.3611709
  37. Jason Youn and Ilias Tagkopoulos. 2023. KGLM: Integrating Knowledge Graph Structure in Language Models for Link Prediction. In Proceedings of the The 12th Joint Conference on Lexical and Computational Semantics, *SEM@ACL 2023, Toronto, Canada, July 13-14, 2023, Alexis Palmer and José Camacho-Collados (Eds.). Association for Computational Linguistics, 217–224. https://doi.org/10.18653/V1/2023.STARSEM-1.20
  38. CoCa: Contrastive Captioners are Image-Text Foundation Models. Trans. Mach. Learn. Res. 2022 (2022).
  39. Relieving Triplet Ambiguity: Consensus Network for Language-Guided Image Retrieval. arXiv preprint arXiv:2306.02092 (2023).
  40. Generative label fused network for image-text matching. Knowl. Based Syst. 263 (2023), 110280. https://doi.org/10.1016/J.KNOSYS.2023.110280
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Xintong Jiang (7 papers)
  2. Yaxiong Wang (34 papers)
  3. Mengjian Li (6 papers)
  4. Yujiao Wu (9 papers)
  5. Bingwen Hu (5 papers)
  6. Xueming Qian (31 papers)
Citations (1)

Summary

We haven't generated a summary for this paper yet.