CompoDiff: Versatile Composed Image Retrieval With Latent Diffusion (2303.11916v4)

Published 21 Mar 2023 in cs.CV and cs.IR

Abstract: This paper proposes a novel diffusion-based model, CompoDiff, for solving zero-shot Composed Image Retrieval (ZS-CIR) with latent diffusion. This paper also introduces a new synthetic dataset, named SynthTriplets18M, with 18.8 million reference images, conditions, and corresponding target image triplets to train CIR models. CompoDiff and SynthTriplets18M tackle the shortages of the previous CIR approaches, such as poor generalizability due to the small dataset scale and the limited types of conditions. CompoDiff not only achieves a new state-of-the-art on four ZS-CIR benchmarks, including FashionIQ, CIRR, CIRCO, and GeneCIS, but also enables a more versatile and controllable CIR by accepting various conditions, such as negative text, and image mask conditions. CompoDiff also shows the controllability of the condition strength between text and image queries and the trade-off between inference speed and performance, which are unavailable with existing CIR methods. The code and dataset are available at https://github.com/navervision/CompoDiff

References (50)

Citations (33)

View on Semantic Scholar

Summary

The paper introduces a novel zero-shot composed image retrieval method that uses latent diffusion models with classifier-free guidance.
It employs a two-stage training process, pre-training on LAION-2B and fine-tuning on the synthetic SynthTriplets18M dataset, leading to significant benchmark improvements.
The approach enables flexible query control and efficient handling of varied conditions, paving the way for scalable and robust CIR systems.

CompoDiff: A Novel Approach to Composed Image Retrieval with Latent Diffusion

This paper introduces CompoDiff, a novel method for zero-shot Composed Image Retrieval (ZS-CIR) that leverages the capabilities of diffusion models in the latent space to generate versatile retrieval queries. The key contributions of this work include CompoDiff's adaptability to various query conditions, such as negative text or masked image features, and its ability to control query strength and inference speed. Additionally, the paper presents SynthTriplets18M, a large synthetic dataset that significantly enhances the generalization capability of CIR models.

Methodology

CompoDiff is built on the framework of latent diffusion models. By operating in the latent space of CLIP embeddings, the model benefits from the efficiency of processing high-dimensional data in a condensed form. Unlike traditional fusion methods, CompoDiff employs a diffusion transformer that incorporates classifier-free guidance (CFG), allowing it to flexibly manage diverse conditions and control feature weights during inference.

The training of CompoDiff is structured in a two-stage process:

The first stage involves pre-training a text-to-image diffusion model on the LAION-2B dataset, with a focus on learning robust image-text relationships.
The second stage fine-tunes this model using SynthTriplets18M, a dataset specifically constructed for CIR tasks, enhancing its ability to handle complex and varied conditions.

SynthTriplets18M, composed of 18.8 million synthetic triplets, addresses the scaling limitations of current CIR datasets. These triplets are generated using a combination of keyword substitution in captions and fine-tuned LLMs, ensuring a diverse range of scenarios without the need for extensive human annotation.

Experimental Results

The results demonstrate CompoDiff's superiority in zero-shot CIR across multiple benchmarks, such as FashionIQ, CIRR, CIRCO, and GeneCIS, showing significant performance improvements compared to existing methods like Pic2Word and SEARLE. For instance, CompoDiff achieves state-of-the-art recall and mAP scores in these datasets, proving its efficacy in handling real-world CIR tasks. Moreover, the paper highlights that training standard CIR models with SynthTriplets18M can elevate their performance to competitive levels, showcasing the dataset's value.

Implications and Future Directions

CompoDiff's innovative approach provides a promising direction for enhancing CIR systems' adaptability and scalability. By manipulating weights for different query conditions, it offers a versatile tool for real-world applications, enabling contextual sensitivity in image retrieval tasks. The dataset SynthTriplets18M, due to its synthetic nature and vast scale, opens avenues for training robust CIR systems without the conventional bottlenecks of manual data collection.

Looking forward, further exploration is warranted in refining controllability mechanisms in diffusion models and extending their applications beyond retrieval, such as personalized recommendations and interactive search systems. Additionally, integrating unCLIP generation capabilities underscores a potential for augmenting user experience in digital media exploration, although ethical considerations around unintended usage must be vigilantly managed.

In summary, CompoDiff represents a substantial advancement in CIR methodology, combining latent diffusion with an expansive synthetic dataset to unlock new levels of retrieval versatility and efficacy.

PDF Markdown

GitHub

GitHub - navervision/CompoDiff: Official Pytorch implementation of "CompoDiff: Versatile Composed Image Retrieval With Latent Diffusion" (84 stars)