- The paper introduces a novel approach that combines spherical linear interpolation and text-anchored-tuning to merge image and text embeddings for enhanced zero-shot retrieval.
- It demonstrates improved performance on benchmarks like CIRR, CIRCO, and FashionIQ with significant gains even after limited training epochs.
- The method overcomes the limitations of supervised models by reducing the need for extensive manual annotations, offering greater scalability and efficiency.
Spherical Linear Interpolation and Text-Anchoring for Zero-shot Composed Image Retrieval
The paper presents a novel approach to Zero-Shot Composed Image Retrieval (ZS-CIR) by utilizing Spherical Linear Interpolation (Slerp) and Text-Anchored-Tuning (TAT). This work addresses the limitations of supervised CIR methods, which suffer from high costs and scalability issues due to their reliance on manually annotated datasets. The proposed method aims to enhance the retrieval performance without the necessity for extensive training datasets or labor-intensive annotation processes.
Introduction and Background
ZS-CIR leverages image-text pairs to perform retrieval tasks where a query consists of an image and a caption specifying desired modifications. Traditional supervised CIR methods rely heavily on annotated datasets, which are expensive and limit their generalizability across diverse domains. Previous attempts at ZS-CIR have utilized pseudo-word token-based methods, which convert images into text-like tokens. However, these methods have faced challenges such as the distortion of the original image representation and confinement of composed embeddings within the textual space.
Methodology
The authors introduce a novel approach that employs Slerp to merge image and text embeddings directly. Slerp is a geometric method that interpolates between two points on a hypersphere, providing an intermediate embedding that effectively represents the combination of image and text inputs.
Spherical Linear Interpolation (Slerp)
Given VLP encoders trained with cosine similarity, Slerp is utilized to find an intermediate embedding c between image embedding v and text embedding w as follows: c=Slerp(v,w;α)=sin(θ)sin((1−α)θ)v+sin(θ)sin(αθ)w
Where:
- θ=cos−1(v⋅w)
- α balances the contributions of v and w.
Text-Anchored-Tuning (TAT)
To enhance the effectiveness of Slerp, TAT is introduced to align image embeddings more closely with text embeddings, mitigating the modality gap. By keeping the text encoder fixed and fine-tuning the image encoder with LoRA parameters, TAT effectively preserves the original knowledge while adjusting image representations to better align with text embeddings: Lcont.=LI2T+LT2I
Inference Process
The composed embedding c obtained through Slerp is used for retrieval by computing its similarity with a pre-computed gallery of image embeddings. Unlike pseudo-token methods, Slerp allows for direct application of user query texts without requiring specific prompts, providing a robust and adaptable solution.
Experimental Results
CIRR, CIRCO, and FashionIQ Benchmarks
Extensive evaluations on natural and fashion image datasets show that the proposed method, especially when combined with TAT, outperforms traditional ZS-CIR methods and image/text-only baselines in key metrics such as Recall and mean Average Precision (mAP). Notably, the method demonstrates significant improvements with as few as a single epoch of training, highlighting its efficiency.
Ablation Studies
The authors conduct thorough ablation studies to validate the impact of TAT across various dataset configurations and training setups. Results confirm the superiority of text-anchoring over alternative anchoring schemes and reinforce the efficiency of the TAT-trained VLP models.
Comparisons with Fine-Tuned Models
The integration of Slerp with fine-tuned models further confirms its utility, yielding competitive and even superior performance against state-of-the-art methods without additional training. This highlights the applicability of Slerp in enhancing existing retrieval systems.
Implications and Future Work
The proposed Slerp-based ZS-CIR and TAT methods present promising advancements for vision-language retrieval tasks. The work underscores the potential for more efficient and scalable retrieval systems that can generalize across diverse domains without heavy reliance on annotated datasets. Future research can explore the expansion of this approach to different retrieval types and more complex benchmarks to further validate and strengthen its applications in various real-world scenarios.
Conclusion
This paper introduces a significant step forward in ZS-CIR by leveraging Slerp and TAT. The combination of these methods allows for efficient, high-performance retrieval without extensive training requirements, marking a major contribution to the field of composed image retrieval. The demonstrated effectiveness and versatility of the approach open up new avenues for further research and application in AI and computer vision.