Visual Delta Generator with Large Multi-modal Models for Semi-supervised Composed Image Retrieval (2404.15516v1)
Abstract: Composed Image Retrieval (CIR) is a task that retrieves images similar to a query, based on a provided textual modification. Current techniques rely on supervised learning for CIR models using labeled triplets of the reference image, text, target image. These specific triplets are not as commonly available as simple image-text pairs, limiting the widespread use of CIR and its scalability. On the other hand, zero-shot CIR can be relatively easily trained with image-caption pairs without considering the image-to-image relation, but this approach tends to yield lower accuracy. We propose a new semi-supervised CIR approach where we search for a reference and its related target images in auxiliary data and learn our LLM-based Visual Delta Generator (VDG) to generate text describing the visual difference (i.e., visual delta) between the two. VDG, equipped with fluent language knowledge and being model agnostic, can generate pseudo triplets to boost the performance of CIR models. Our approach significantly improves the existing supervised learning approaches and achieves state-of-the-art results on the CIR benchmarks.
- Flamingo: a visual language model for few-shot learning. NeurIPS, 35:23716–23736, 2022.
- Effective conditioned and composed image retrieval combining clip-based features. In CVPR, 2022.
- Zero-shot composed image retrieval with textual inversion. In ICCV, 2023.
- Mixmatch: A holistic approach to semi-supervised learning. NeurIPS, 32, 2019.
- Language models are few-shot learners. NeurIPS, 33:1877–1901, 2020.
- Uniter: Universal image-text representation learning. In ECCV, pages 104–120. Springer, 2020.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, 2023.
- Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
- “this is my unicorn, fluffy”: Personalizing frozen vision-language representations. In ECCV, pages 558–577. Springer, 2022.
- Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023.
- Artemis: Attention-based retrieval with text-explicit matching and implicit similarity. In ICLR, 2022.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- Eva: Exploring the limits of masked visual representation learning at scale. In CVPR, pages 19358–19369, 2023.
- Fashionvlp: Vision language transformer for fashion retrieval with feedback. In CVPR, pages 14105–14115, 2022.
- Semi-supervised learning by entropy minimization. NeurIPS, 17, 2004.
- Compodiff: Versatile composed image retrieval with latent diffusion. arXiv preprint arXiv:2303.11916, 2023.
- Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
- Decoupled deep neural network for semi-supervised semantic segmentation. NeurIPS, 2015.
- Lora: Low-rank adaptation of large language models. In ICLR, 2021.
- Generalized product quantization network for semi-supervised image retrieval. In CVPR, pages 3420–3429, 2020.
- Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL, pages 4171–4186, 2019.
- Dual compositional learning in interactive image retrieval. In AAAI, pages 1771–1779, 2021.
- Grounding language models to images for multimodal inputs and outputs. 2023.
- Imagenet classification with deep convolutional neural networks. NeurIPS, 25, 2012.
- Temporal ensembling for semi-supervised learning. arXiv preprint arXiv:1610.02242, 2016.
- Semi-Supervised Learning. Semi-supervised learning. CSZ2006. html, 2006.
- Dong-Hyun Lee et al. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In Workshop on challenges in representation learning, ICML, page 896. Atlanta, 2013.
- Data roaming and early fusion for composed image retrieval. arXiv preprint arXiv:2303.09429, 2023.
- Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, 2022.
- Align before fuse: Vision and language representation learning with momentum distillation. NeurIPS, 34:9694–9705, 2021.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. 2023.
- Microsoft coco: Common objects in context. In ECCV, pages 740–755. Springer, 2014.
- Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023.
- Deepfashion: Powering robust clothes recognition and retrieval with rich annotations. In CVPR, 2016.
- Image retrieval on real-life images with pre-trained vision-and-language models. In ICCV, 2021a.
- Image retrieval on real-life images with pre-trained vision-and-language models. In CVPR, pages 2125–2134, 2021b.
- Decoupled weight decay regularization. In ICLR, 2018.
- Image matching from handcrafted to deep features: A survey. IJCV, 129:23–79, 2021.
- Virtual adversarial training: a regularization method for supervised and semi-supervised learning. TPAMI, 41(8):1979–1993, 2018.
- Filtering, distillation, and hard negatives for vision-language pre-training. In CVPR, pages 6967–6977, 2023.
- Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763. PMLR, 2021.
- Pic2word: Mapping pictures to words for zero-shot composed image retrieval. In CVPR, pages 19305–19314, 2023.
- Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In ACL, pages 2556–2565, 2018.
- Fixmatch: Simplifying semi-supervised learning with consistency and confidence. NeurIPS, 33:596–608, 2020.
- A corpus for reasoning about natural language grounded in photographs. In ACL, 2019.
- Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. NeurIPS, 30, 2017.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
- CoVR: Learning composed video retrieval from web video captions. arXiv:2308.14746, 2023.
- Composing text and image for image retrieval-an empirical odyssey. In CVPR, pages 6439–6448, 2019.
- Self-training boosted multi-faceted matching network for composed image retrieval. arXiv preprint arXiv:2305.09979, 2023.
- Fashion iq: A new dataset towards retrieving images by natural language feedback. In CVPR, pages 11307–11317, 2021.
- A survey of large language models. arXiv preprint arXiv:2303.18223, 2023.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.