- The paper introduces LinCIR, a framework that achieves efficient zero-shot composed image retrieval solely through language-only training.
- Its Self-Masking Projection technique replaces key textual features with projected embeddings, ensuring semantic consistency while reducing training costs.
- Experiments on benchmarks like CIRCO and FashionIQ highlight LinCIR's superior performance and scalability in complex vision-language tasks.
Language-only Efficient Training of Zero-shot Composed Image Retrieval
Introduction
The task of Composed Image Retrieval (CIR) aims to retrieve images that meet specific query conditions composed of both image and text inputs. Traditional CIR methods necessitate a training dataset comprising triplets of query image, query text, and target image, which is an expensive and labor-intensive process. Recent strategies have investigated the zero-shot composed image retrieval (ZS-CIR) paradigm to address this problem by eliminating the need for pre-collected triplets. These ZS-CIR approaches, however, have shown limitations in scalability and generalizability due to the lack of diverse textual inputs.
LinCIR (Language-only training for CIR) is introduced as a novel framework utilizing only language data for training. This method leverages a technique called Self-Masking Projection (SMP) for self-supervision. It replaces keywords in the original text with their projected embeddings, ensuring that the new and original texts have the same latent embedding vector. This innovative strategy enables LinCIR to achieve remarkable efficiency and effectiveness, showing superior performance on various CIR benchmarks, including achieving state-of-the-art results in zero-shot scenarios.
Methodology
Self-Masking Projection (SMP)
LinCIR utilizes a language-only self-supervision technique called Self-Masking Projection (SMP). Instead of the common practice of projecting image embeddings, SMP projects the text latent embedding to the token embedding space. During training, keywords in the textual input are replaced by their projected embeddings. This replacement is guided by minimizing the mean squared error (MSE) between the original and generated embeddings. Keywords are defined as consecutive adjectives and nouns, ensuring that the primary semantic essence of the text is retained in its modified form.
Random Noise Addition Strategy
To mitigate the modality gap between textual and visual embeddings, LinCIR introduces a noise addition strategy. Unlike simpler approaches that add Gaussian noise, LinCIR uses a noise distribution that ensures a diverse range of norm sizes, specifically N(0,1)×Unif(0,1). This strategy addresses the dimensionality issues and the insufficient diversity problem of simple Gaussian noise, enhancing the generalizability of the projection module to visual embeddings.
Results
LinCIR demonstrates exceptional performance across multiple benchmarks, outperforming other ZS-CIR strategies such as Pic2Word and SEARLE. Key numerical results include:
- On the CIRCO benchmark, LinCIR achieved leading scores across all metrics (e.g., mAP@5 of 19.71 with the ViT-G backbone).
- On GeneCIS, LinCIR outperformed other methods in R@K metrics, especially in tasks focused on attributes.
- For FashionIQ, LinCIR even surpassed the state-of-the-art supervised method.
Implications and Future Directions
LinCIR's framework presents significant implications for the field of image retrieval and vision-LLMs. By leveraging language-only training, LinCIR reduces the training dataset size and training time, markedly enhancing efficiency and scalability. This method also addresses the limitations of previous ZS-CIR models, showcasing superior adaptability to diverse and complex textual queries.
Future developments could explore further optimizations in noise addition strategies or investigate additional self-supervision techniques. Moreover, given the versatility exhibited by LinCIR, integrating this framework with other vision-LLMs (such as BLIP) could further enhance cross-modal retrieval capabilities and expand research into more diverse applications and domains.
Conclusion
LinCIR establishes an efficient approach to zero-shot composed image retrieval by employing self-supervision through language-only data and introducing an innovative random noise addition strategy. The method's scalability and marked performance improvements across various benchmarks underline its potential as a robust framework for vision-language tasks.