Spherical Linear Interpolation and Text-Anchoring for Zero-shot Composed Image Retrieval (2405.00571v1)

Published 1 May 2024 in cs.CV and cs.AI

Abstract: Composed Image Retrieval (CIR) is a complex task that retrieves images using a query, which is configured with an image and a caption that describes desired modifications to that image. Supervised CIR approaches have shown strong performance, but their reliance on expensive manually-annotated datasets restricts their scalability and broader applicability. To address these issues, previous studies have proposed pseudo-word token-based Zero-Shot CIR (ZS-CIR) methods, which utilize a projection module to map images to word tokens. However, we conjecture that this approach has a downside: the projection module distorts the original image representation and confines the resulting composed embeddings to the text-side. In order to resolve this, we introduce a novel ZS-CIR method that uses Spherical Linear Interpolation (Slerp) to directly merge image and text representations by identifying an intermediate embedding of both. Furthermore, we introduce Text-Anchored-Tuning (TAT), a method that fine-tunes the image encoder while keeping the text encoder fixed. TAT closes the modality gap between images and text, making the Slerp process much more effective. Notably, the TAT method is not only efficient in terms of the scale of the training dataset and training time, but it also serves as an excellent initial checkpoint for training supervised CIR models, thereby highlighting its wider potential. The integration of the Slerp-based ZS-CIR with a TAT-tuned model enables our approach to deliver state-of-the-art retrieval performance across CIR benchmarks.

References (2)

Citations (5)

View on Semantic Scholar

Summary

The paper introduces a novel approach that combines spherical linear interpolation and text-anchored-tuning to merge image and text embeddings for enhanced zero-shot retrieval.
It demonstrates improved performance on benchmarks like CIRR, CIRCO, and FashionIQ with significant gains even after limited training epochs.
The method overcomes the limitations of supervised models by reducing the need for extensive manual annotations, offering greater scalability and efficiency.

Spherical Linear Interpolation and Text-Anchoring for Zero-shot Composed Image Retrieval

The paper presents a novel approach to Zero-Shot Composed Image Retrieval (ZS-CIR) by utilizing Spherical Linear Interpolation (Slerp) and Text-Anchored-Tuning (TAT). This work addresses the limitations of supervised CIR methods, which suffer from high costs and scalability issues due to their reliance on manually annotated datasets. The proposed method aims to enhance the retrieval performance without the necessity for extensive training datasets or labor-intensive annotation processes.

Introduction and Background

ZS-CIR leverages image-text pairs to perform retrieval tasks where a query consists of an image and a caption specifying desired modifications. Traditional supervised CIR methods rely heavily on annotated datasets, which are expensive and limit their generalizability across diverse domains. Previous attempts at ZS-CIR have utilized pseudo-word token-based methods, which convert images into text-like tokens. However, these methods have faced challenges such as the distortion of the original image representation and confinement of composed embeddings within the textual space.

Methodology

The authors introduce a novel approach that employs Slerp to merge image and text embeddings directly. Slerp is a geometric method that interpolates between two points on a hypersphere, providing an intermediate embedding that effectively represents the combination of image and text inputs.

Spherical Linear Interpolation (Slerp)

Given VLP encoders trained with cosine similarity, Slerp is utilized to find an intermediate embedding $c$ between image embedding $v$ and text embedding $w$ as follows: $c = \text{Slerp}(v, w; \alpha) = \frac{\sin((1-\alpha)\theta)}{\sin(\theta)}v + \frac{\sin(\alpha\theta)}{\sin(\theta)}w$ Where:

$\theta = \cos^{-1}(v \cdot w)$
$\alpha$ balances the contributions of $v$ and $w$ .

Text-Anchored-Tuning (TAT)

To enhance the effectiveness of Slerp, TAT is introduced to align image embeddings more closely with text embeddings, mitigating the modality gap. By keeping the text encoder fixed and fine-tuning the image encoder with LoRA parameters, TAT effectively preserves the original knowledge while adjusting image representations to better align with text embeddings: $\mathcal{L}_{cont.} = \mathcal{L}_{I2T} + \mathcal{L}_{T2I}$

Inference Process

The composed embedding $c$ obtained through Slerp is used for retrieval by computing its similarity with a pre-computed gallery of image embeddings. Unlike pseudo-token methods, Slerp allows for direct application of user query texts without requiring specific prompts, providing a robust and adaptable solution.

Experimental Results

CIRR, CIRCO, and FashionIQ Benchmarks

Extensive evaluations on natural and fashion image datasets show that the proposed method, especially when combined with TAT, outperforms traditional ZS-CIR methods and image/text-only baselines in key metrics such as Recall and mean Average Precision (mAP). Notably, the method demonstrates significant improvements with as few as a single epoch of training, highlighting its efficiency.

Ablation Studies

The authors conduct thorough ablation studies to validate the impact of TAT across various dataset configurations and training setups. Results confirm the superiority of text-anchoring over alternative anchoring schemes and reinforce the efficiency of the TAT-trained VLP models.

Comparisons with Fine-Tuned Models

The integration of Slerp with fine-tuned models further confirms its utility, yielding competitive and even superior performance against state-of-the-art methods without additional training. This highlights the applicability of Slerp in enhancing existing retrieval systems.

Implications and Future Work

The proposed Slerp-based ZS-CIR and TAT methods present promising advancements for vision-language retrieval tasks. The work underscores the potential for more efficient and scalable retrieval systems that can generalize across diverse domains without heavy reliance on annotated datasets. Future research can explore the expansion of this approach to different retrieval types and more complex benchmarks to further validate and strengthen its applications in various real-world scenarios.

Conclusion

This paper introduces a significant step forward in ZS-CIR by leveraging Slerp and TAT. The combination of these methods allows for efficient, high-performance retrieval without extensive training requirements, marking a major contribution to the field of composed image retrieval. The demonstrated effectiveness and versatility of the approach open up new avenues for further research and application in AI and computer vision.

PDF Markdown