Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Data Roaming and Quality Assessment for Composed Image Retrieval (2303.09429v2)

Published 16 Mar 2023 in cs.CV

Abstract: The task of Composed Image Retrieval (CoIR) involves queries that combine image and text modalities, allowing users to express their intent more effectively. However, current CoIR datasets are orders of magnitude smaller compared to other vision and language (V&L) datasets. Additionally, some of these datasets have noticeable issues, such as queries containing redundant modalities. To address these shortcomings, we introduce the Large Scale Composed Image Retrieval (LaSCo) dataset, a new CoIR dataset which is ten times larger than existing ones. Pre-training on our LaSCo, shows a noteworthy improvement in performance, even in zero-shot. Furthermore, we propose a new approach for analyzing CoIR datasets and methods, which detects modality redundancy or necessity, in queries. We also introduce a new CoIR baseline, the Cross-Attention driven Shift Encoder (CASE). This baseline allows for early fusion of modalities using a cross-attention module and employs an additional auxiliary task during training. Our experiments demonstrate that this new baseline outperforms the current state-of-the-art methods on established benchmarks like FashionIQ and CIRR.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (49)
  1. Nocaps: Novel object captioning at scale. In ICCV, 8948–8957.
  2. VQA: Visual Question Answering. In ICCV.
  3. Effective conditioned and composed image retrieval combining CLIP-based features. In CVPR, 21434–21442. IEEE.
  4. Content-based image retrieval and the semantic gap in the deep learning era. In ICPR, 245–260.
  5. Language Models are Few-Shot Learners. In NeurIPS, volume 33, 1877–1901.
  6. Generic Attention-Model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers. In ICCV, 397–406.
  7. Collecting Highly Parallel Data for Paraphrase Evaluation. In Lin, D.; Matsumoto, Y.; and Mihalcea, R., eds., ACL, 190–200.
  8. Learning Joint Visual Semantic Matching Embeddings for Language-Guided Retrieval. In ECCV, volume 12367, 136–152.
  9. Image Search With Text Feedback by Visiolinguistic Attention Learning. In CVPR, 2998–3008.
  10. Embedding Arithmetic of Multimodal Queries for Image Retrieval. In CVPRW, 4946–4954. IEEE.
  11. Visual Dialog. In CVPR.
  12. ARTEMIS: Attention-based Retrieval with Text-Explicit Matching and Implicit Similarity. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net.
  13. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL-HLT.
  14. Modality-Agnostic Attention Fusion for visual search with text feedback. CoRR, abs/2007.00145.
  15. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In ICLR.
  16. Dubey, S. R. 2021. A decade survey of content based image retrieval using deep learning. IEEE Transactions on Circuits and Systems for Video Technology, 32(5): 2687–2704.
  17. FashionVLP: Vision Language Transformer for Fashion Retrieval with Feedback. In CVPR, 14085–14095. IEEE.
  18. Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering. In CVPR.
  19. Dialog-based Interactive Image Retrieval. In NeurIPS, 676–686.
  20. Automatic Spatially-Aware Fashion Concept Discovery. In ICCV, 1472–1480.
  21. Deep Residual Learning for Image Recognition. In CVPR.
  22. Composed Query Image Retrieval Using Locally Bounded Features. In CVPR, 3593–3602. Computer Vision Foundation / IEEE.
  23. Discovering states and transformations in image collections. In CVPR, 1383–1391.
  24. SAC: Semantic Attention Composition for Text-Conditioned Image Retrieval. In WACV, 4021–4030.
  25. TRACE: Transform Aggregate and Compose Visiolinguistic Representations for Image Search with Text Feedback. CoRR, abs/2009.01485.
  26. Dual Compositional Learning in Interactive Image Retrieval. AAAI, 35(2): 1771–1779.
  27. CoSMo: Content-Style Modulation for Image Retrieval With Text Feedback. In CVPR, 802–812.
  28. Classification-Regression for Chart comprehension. In ECCV, 469–484.
  29. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. In ICML, 12888–12900.
  30. Align before Fuse: Vision and Language Representation Learning with Momentum Distillation. In Ranzato, M.; Beygelzimer, A.; Dauphin, Y. N.; Liang, P.; and Vaughan, J. W., eds., NeurIPS, 9694–9705.
  31. Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks. In ECCV.
  32. Microsoft COCO: Common Objects in Context. In ECCV, volume 8693, 740–755.
  33. Image Retrieval on Real-life Images with Pre-trained Vision-and-Language Models. In ICCV, 2105–2114.
  34. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks. In NeurIPS, 13–23.
  35. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In CVPR, 2630–2640.
  36. Learning Audio-Video Modalities from Image Captions. ECCV.
  37. Recall@k surrogate loss with large batches and similarity mixup. In CVPR, 7502–7511.
  38. Learning Transferable Visual Models From Natural Language Supervision. In Meila, M.; and Zhang, T., eds., ICML.
  39. RTIC: Residual Learning for Text and Image Composition using Graph Convolutional Network. CoRR, abs/2104.03015.
  40. A Corpus for Reasoning about Natural Language Grounded in Photographs. In ACL, 6418–6428.
  41. Explainable, interactive c ontent-based image retrieval. Applied AI Letters, 2(4): e41.
  42. Attention Is All You Need. In NeurIPS.
  43. Composing Text and Image for Image Retrieval - an Empirical Odyssey. In CVPR, 6432–6441.
  44. Fashion IQ: A New Dataset Towards Retrieving Images by Natural Language Feedback. In CVPR, 11307–11317.
  45. MSR-VTT: A Large Video Description Dataset for Bridging Video and Language. In CVPR, 5288–5296. IEEE Computer Society.
  46. Just ask: Learning to answer questions from millions of narrated videos. In ICCV, 1686–1697.
  47. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Trans. Assoc. Comput. Linguistics, 2: 67–78.
  48. CurlingNet: Compositional Learning between Images and Text for Fashion IQ Data. CoRR, abs/2003.12299.
  49. Few-shot learning for remote sensing image retrieval with maml. In ICIP, 2446–2450. IEEE.
Citations (13)

Summary

We haven't generated a summary for this paper yet.