Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
149 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CompoDiff: Versatile Composed Image Retrieval With Latent Diffusion (2303.11916v4)

Published 21 Mar 2023 in cs.CV and cs.IR

Abstract: This paper proposes a novel diffusion-based model, CompoDiff, for solving zero-shot Composed Image Retrieval (ZS-CIR) with latent diffusion. This paper also introduces a new synthetic dataset, named SynthTriplets18M, with 18.8 million reference images, conditions, and corresponding target image triplets to train CIR models. CompoDiff and SynthTriplets18M tackle the shortages of the previous CIR approaches, such as poor generalizability due to the small dataset scale and the limited types of conditions. CompoDiff not only achieves a new state-of-the-art on four ZS-CIR benchmarks, including FashionIQ, CIRR, CIRCO, and GeneCIS, but also enables a more versatile and controllable CIR by accepting various conditions, such as negative text, and image mask conditions. CompoDiff also shows the controllability of the condition strength between text and image queries and the trade-off between inference speed and performance, which are unavailable with existing CIR methods. The code and dataset are available at https://github.com/navervision/CompoDiff

Definition Search Book Streamline Icon: https://streamlinehq.com
References (50)
  1. Compositional learning of image-text query for image retrieval. (arXiv:2006.11149), May 2021. URL http://arxiv.org/abs/2006.11149. arXiv:2006.11149 [cs].
  2. ediffi: Text-to-image diffusion models with an ensemble of expert denoisers. arXiv preprint arXiv:2211.01324, 2022.
  3. Conditioned and composed image retrieval combining and partially fine-tuning clip-based features. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  4959–4968, 2022.
  4. Zero-shot composed image retrieval with textual inversion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023.
  5. Instructpix2pix: Learning to follow image editing instructions. arXiv preprint arXiv:2211.09800, 2022.
  6. Coyo-700m: Image-text pair dataset. https://github.com/kakaobrain/coyo-dataset, 2022.
  7. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015.
  8. Composed image retrieval with text feedback via multi-grained uncertainty regularization. arXiv preprint arXiv:2211.07394, 2022.
  9. Sanghyuk Chun. Improved probabilistic image-text representations. arXiv preprint arXiv:2305.18171, 2023.
  10. Probabilistic embeddings for cross-modal retrieval. In Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
  11. Eccv caption: Correcting false negatives by collecting machine-and-human-verified image-caption associations for ms-coco. In European Conference on Computer Vision (ECCV), 2022.
  12. Artemis: Attention-based retrieval with text-explicit matching and implicit similarity. In International Conference on Learning Representations, 2022.
  13. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021.
  14. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  12873–12883, 2021.
  15. Stylegan-nada: Clip-guided domain adaptation of image generators. ACM Transactions on Graphics (TOG), 41(4):1–13, 2022.
  16. Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626, 2022.
  17. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
  18. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020.
  19. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  20. Openclip, July 2021. URL https://doi.org/10.5281/zenodo.5143773. If you use this software, please cite it as below.
  21. Billion-scale similarity search with gpus. IEEE Transactions on Big Data, 7(3):535–547, 2019.
  22. Karlo-v1.0.alpha on coyo-100m and cc15m. https://github.com/kakaobrain/karlo, 2022.
  23. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pp. 740–755. Springer, 2014.
  24. Image retrieval on real-life images with pre-trained vision-and-language models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  2125–2134, 2021.
  25. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  26. Image segmentation using text and image prompts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  7086–7096, 2022.
  27. A metric learning reality check. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp.  681–699. Springer, 2020.
  28. Unite and conquer: Cross dataset multimodal synthesis using diffusion models. (arXiv:2212.00793), Dec 2022. URL http://arxiv.org/abs/2212.00793. arXiv:2212.00793 [cs].
  29. Crisscrossed captions: Extended intramodal and intermodal semantic similarity judgments for ms-coco. arXiv preprint arXiv:2004.15020, 2020.
  30. Scalable diffusion models with transformers. arXiv preprint arXiv:2212.09748, 2022.
  31. Learning to predict visual attributes in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  13018–13028, June 2021.
  32. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp. 8748–8763. PMLR, 2021.
  33. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
  34. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
  35. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  10684–10695, 2022.
  36. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pp.  234–241. Springer, 2015.
  37. Pic2word: Mapping pictures to words for zero-shot composed image retrieval. arXiv preprint arXiv:2302.03084, 2023.
  38. Laion-5b: An open large-scale dataset for training next generation image-text models. arXiv preprint arXiv:2210.08402, 2022a.
  39. Laion coco: 600m synthetic captions from laion2b-en. https://huggingface.co/datasets/laion/laion-coco, 2022b.
  40. Boosting zero-shot classification with synthetic data diversity via stable diffusion. (arXiv:2302.03298), Feb 2023. URL http://arxiv.org/abs/2302.03298. arXiv:2302.03298 [cs].
  41. ViDT: An efficient and effective fully transformer-based object detector. In International Conference on Learning Representations (ICLR), 2022.
  42. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
  43. A corpus for reasoning about natural language grounded in photographs. arXiv preprint arXiv:1811.00491, 2018.
  44. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  45. Genecis: A benchmark for general conditional image similarity. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
  46. Composing text and image for image retrieval-an empirical odyssey. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  6439–6448, 2019.
  47. Fashion iq: A new dataset towards retrieving images by natural language feedback. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  11307–11317, 2021.
  48. Curlingnet: Compositional learning between images and text for fashion iq data. arXiv preprint arXiv:2003.12299, 2020.
  49. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  3836–3847, 2023.
  50. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
Citations (33)

Summary

  • The paper introduces a novel zero-shot composed image retrieval method that uses latent diffusion models with classifier-free guidance.
  • It employs a two-stage training process, pre-training on LAION-2B and fine-tuning on the synthetic SynthTriplets18M dataset, leading to significant benchmark improvements.
  • The approach enables flexible query control and efficient handling of varied conditions, paving the way for scalable and robust CIR systems.

CompoDiff: A Novel Approach to Composed Image Retrieval with Latent Diffusion

This paper introduces CompoDiff, a novel method for zero-shot Composed Image Retrieval (ZS-CIR) that leverages the capabilities of diffusion models in the latent space to generate versatile retrieval queries. The key contributions of this work include CompoDiff's adaptability to various query conditions, such as negative text or masked image features, and its ability to control query strength and inference speed. Additionally, the paper presents SynthTriplets18M, a large synthetic dataset that significantly enhances the generalization capability of CIR models.

Methodology

CompoDiff is built on the framework of latent diffusion models. By operating in the latent space of CLIP embeddings, the model benefits from the efficiency of processing high-dimensional data in a condensed form. Unlike traditional fusion methods, CompoDiff employs a diffusion transformer that incorporates classifier-free guidance (CFG), allowing it to flexibly manage diverse conditions and control feature weights during inference.

The training of CompoDiff is structured in a two-stage process:

  1. The first stage involves pre-training a text-to-image diffusion model on the LAION-2B dataset, with a focus on learning robust image-text relationships.
  2. The second stage fine-tunes this model using SynthTriplets18M, a dataset specifically constructed for CIR tasks, enhancing its ability to handle complex and varied conditions.

SynthTriplets18M, composed of 18.8 million synthetic triplets, addresses the scaling limitations of current CIR datasets. These triplets are generated using a combination of keyword substitution in captions and fine-tuned LLMs, ensuring a diverse range of scenarios without the need for extensive human annotation.

Experimental Results

The results demonstrate CompoDiff's superiority in zero-shot CIR across multiple benchmarks, such as FashionIQ, CIRR, CIRCO, and GeneCIS, showing significant performance improvements compared to existing methods like Pic2Word and SEARLE. For instance, CompoDiff achieves state-of-the-art recall and mAP scores in these datasets, proving its efficacy in handling real-world CIR tasks. Moreover, the paper highlights that training standard CIR models with SynthTriplets18M can elevate their performance to competitive levels, showcasing the dataset's value.

Implications and Future Directions

CompoDiff's innovative approach provides a promising direction for enhancing CIR systems' adaptability and scalability. By manipulating weights for different query conditions, it offers a versatile tool for real-world applications, enabling contextual sensitivity in image retrieval tasks. The dataset SynthTriplets18M, due to its synthetic nature and vast scale, opens avenues for training robust CIR systems without the conventional bottlenecks of manual data collection.

Looking forward, further exploration is warranted in refining controllability mechanisms in diffusion models and extending their applications beyond retrieval, such as personalized recommendations and interactive search systems. Additionally, integrating unCLIP generation capabilities underscores a potential for augmenting user experience in digital media exploration, although ethical considerations around unintended usage must be vigilantly managed.

In summary, CompoDiff represents a substantial advancement in CIR methodology, combining latent diffusion with an expansive synthetic dataset to unlock new levels of retrieval versatility and efficacy.