Zero-Shot Composed Image Retrieval with Textual Inversion (2303.15247v2)

Published 27 Mar 2023 in cs.CV, cs.CL, and cs.IR

Abstract: Composed Image Retrieval (CIR) aims to retrieve a target image based on a query composed of a reference image and a relative caption that describes the difference between the two images. The high effort and cost required for labeling datasets for CIR hamper the widespread usage of existing methods, as they rely on supervised learning. In this work, we propose a new task, Zero-Shot CIR (ZS-CIR), that aims to address CIR without requiring a labeled training dataset. Our approach, named zero-Shot composEd imAge Retrieval with textuaL invErsion (SEARLE), maps the visual features of the reference image into a pseudo-word token in CLIP token embedding space and integrates it with the relative caption. To support research on ZS-CIR, we introduce an open-domain benchmarking dataset named Composed Image Retrieval on Common Objects in context (CIRCO), which is the first dataset for CIR containing multiple ground truths for each query. The experiments show that SEARLE exhibits better performance than the baselines on the two main datasets for CIR tasks, FashionIQ and CIRR, and on the proposed CIRCO. The dataset, the code and the model are publicly available at https://github.com/miccunifi/SEARLE.

Citations (65)

View on Semantic Scholar

Summary

The paper's main contribution is SEARLE, a novel framework for zero-shot composed image retrieval that eliminates the need for expensive labeled datasets.
It utilizes a frozen CLIP model with optimization-based textual inversion (OTI) to map image features into a pseudo-word embedding space for effective retrieval.
Experimental evaluations on FashionIQ, CIRR, and CIRCO demonstrate that SEARLE outperforms baselines, showcasing its scalability and applicability to broader vision-language tasks.

Zero-Shot Composed Image Retrieval with Textual Inversion: An Expert Review

The paper under review introduces an innovative approach to Composed Image Retrieval (CIR), specifically focusing on the Zero-Shot Composed Image Retrieval (ZS-CIR) paradigm. The authors propose a novel method termed SEARLE, which stands for zero-Shot composEd imAge Retrieval with textuaL invErsion. The principal objective of SEARLE is to perform CIR without the need for an expensive labeled training dataset, which traditionally poses significant limitations due to high resource consumption during dataset labeling.

Methodology

The core of SEARLE's approach hinges on textual inversion within the framework of the CLIP (Contrastive Language-Image Pre-training) model. SEARLE effectively utilizes a frozen pre-trained CLIP model, leveraging its image and text encoders to align images and textual representations in a shared embedding space. The mechanism involves mapping the visual features of a reference image into a specific pseudo-word token within the CLIP token embedding space. This pseudo-word token is then concatenated with a relative caption to create a text-based query that facilitates zero-shot retrieval.

An essential component of the SEARLE methodology is the Optimization-based Textual Inversion (OTI), which iteratively generates pseudo-word tokens using a GPT-powered regularization loss. This enables the pseudo-word tokens to reside appropriately within the CLIP token embedding manifold, thus enhancing their interaction capabilities with existing textual vocabularies. Furthermore, to achieve computational efficiency in practical applications, SEARLE incorporates a textual inversion network, $\phi$ , trained through a distillation process from the knowledge encoded in OTI.

Experimental Evaluation

The empirical validation of SEARLE encompasses evaluations across multiple datasets, including FashionIQ, CIRR, and the newly introduced CIRCO dataset. The results substantiate that SEARLE, alongside its variant SEARLE-OTI, outperforms various baselines and existing methods, including PALAVRA and Pic2Word. Notably, SEARLE demonstrates competitive results with effective performance on tasks within the CIRR and CIRCO datasets, showcasing its robustness and scalability in diverse contexts.

The CIRCO dataset, introduced for assessing the effectiveness of zero-shot CIR, brings an innovative dimension to the evaluation framework by offering multiple ground truths for comprehensive performance analysis. This is in contrast to traditional single ground truth datasets, thus addressing prevalent issues such as false negatives that hinder accurate performance measurement.

Theoretical Implications and Future Directions

The SEARLE framework marks a significant step forward in the domain of CIR by challenging the reliance on supervised methods that necessitate expensive annotations. The theoretical implications of this work suggest a shift towards integrating LLMs with vision models in a zero-shot context, potentially paving the way for broader applications beyond retrieval tasks.

Moving forward, SEARLE's approach could be extended to other vision-language tasks, such as personalized image synthesis, where the ability to leverage pre-trained models without additional labeled data could be particularly impactful. Moreover, future research could explore the integration of more sophisticated LLMs and fine-tuning strategies that could enhance the interpretability and flexibility of pseudo-word tokens, further improving retrieval performance and reducing domain-specific biases.

In conclusion, this paper presents a well-founded approach to CIR by framing it within a zero-shot paradigm, leveraging textual inversion and domain-crossing neural architectures. The introduction of the CIRCO dataset not only underscores the practicality of the proposed method but also contributes a valuable resource for subsequent research endeavors in the field.

PDF Markdown

Related Papers

GitHub

GitHub - miccunifi/SEARLE: [ICCV 2023] - Zero-shot Composed Image Retrieval with Textual Inversion (163 stars)