Pic2Word: Mapping Pictures to Words for Zero-shot Composed Image Retrieval (2302.03084v2)

Published 6 Feb 2023 in cs.CV

Abstract: In Composed Image Retrieval (CIR), a user combines a query image with text to describe their intended target. Existing methods rely on supervised learning of CIR models using labeled triplets consisting of the query image, text specification, and the target image. Labeling such triplets is expensive and hinders broad applicability of CIR. In this work, we propose to study an important task, Zero-Shot Composed Image Retrieval (ZS-CIR), whose goal is to build a CIR model without requiring labeled triplets for training. To this end, we propose a novel method, called Pic2Word, that requires only weakly labeled image-caption pairs and unlabeled image datasets to train. Unlike existing supervised CIR models, our model trained on weakly labeled or unlabeled datasets shows strong generalization across diverse ZS-CIR tasks, e.g., attribute editing, object composition, and domain conversion. Our approach outperforms several supervised CIR methods on the common CIR benchmark, CIRR and Fashion-IQ. Code will be made publicly available at https://github.com/google-research/composed_image_retrieval.

Citations (72)

View on Semantic Scholar

Summary

The paper introduces a novel zero-shot approach by mapping images to pseudo tokens, eliminating the need for extensive labeled triplets.
It leverages a two-stage training method using CLIP and a mapping network optimized with contrastive loss to align visual and textual data.
Experimental results across ImageNet, COCO, CIRR, and Fashion-IQ demonstrate significant recall improvements and enhanced generalization.

An Essay on Pic2Word: Mapping Pictures to Words for Zero-shot Composed Image Retrieval

The research paper titled "Pic2Word: Mapping Pictures to Words for Zero-shot Composed Image Retrieval" presents a novel approach to the challenge of Composed Image Retrieval (CIR). Unlike traditional methods which require a substantial amount of labeled triplets, this work introduces Zero-Shot Composed Image Retrieval (ZS-CIR), substantially reducing the need for extensive data preparation.

Problem Statement

CIR is an advanced retrieval task where a query is composed of an image and accompanying descriptive text. Existing methods like late-fusion techniques depend heavily on labeled triplets that consist of a reference image, textual modification, and a target image. The cost and effort required to gather such explicitly labeled data are significant bottlenecks, hindering CIR's broader application. Moreover, these supervised models often specialize in specific use cases and exhibit poor generalization across diverse CIR tasks.

Contribution and Methodology

Addressing this gap, the authors propose ZS-CIR, aiming to build a versatile CIR model without relying on costly triplet annotations. They leverage a novel method, Pic2Word, that employs weakly labeled image-caption pairs and unlabeled image datasets for training. By reframing the CIR task, Pic2Word focuses on mapping visual content directly to pseudo language tokens, allowing flexible and seamless composition of queries.

The workflow involves two primary stages:

Pre-training on Image-Caption Data: The method starts with the Contrastive Language-Image Pretraining (CLIP) model to maximize the similarity between image and caption embeddings using a standard two-tower architecture.
Training a Mapping Network: Unlike prior supervised methods, Pic2Word trains a mapping network that transforms the image embedding from the vision encoder into a token embedding compatible with the language encoder. This network is optimized using a contrastive loss function, ensuring the pseudo token generated represents the corresponding visual embedding accurately.

Experimental Results

The efficacy of Pic2Word was tested across various experimental setups:

Domain Conversion: On the ImageNet and ImageNet-R datasets, Pic2Word achieved significant improvements over zero-shot and supervised baselines with relative improvements ranging from 10% to 100% in terms of recall metrics.
Object Composition: Tested on COCO, Pic2Word demonstrated superior performance against zero-shot baselines and was competitive with supervised methods trained on labeled datasets.
Scene Manipulation: Using CIRR, Pic2Word outperformed several zero-shot and some supervised CIR approaches in capturing the modifications described in the text.
Fashion Attribute Manipulation: On the Fashion-IQ dataset, Pic2Word showed notable retrieval improvements, indicating its robustness in handling complex attribute-based queries.

Analysis and Implications

The results reveal that the approach of representing visual data as language tokens significantly enhances the model's capability to handle diverse CIR tasks without extensive labeled datasets. One striking observation is the strong performance of Pic2Word in scenarios where text alone is very informative, indicating its adaptability in balancing the importance of image and text features accurately.

Theoretical and Practical Implications

From a theoretical standpoint, this research opens avenues to rethink data representation in vision-LLMs. The ability to map images to word-like tokens seamlessly integrates visual and language modalities, leveraging the robustness of pre-trained LLMs in handling diverse concepts and attributes. Practically, Pic2Word paves the way for broader applicability of CIR tasks in domains such as e-commerce and content creation, where labeled data is often a precious resource.

Future Directions

While Pic2Word presents a significant leap in CIR methodologies, future research could explore extending the mapping network to produce multiple word tokens per image, capturing more granular details. Furthermore, enhancing the robustness of token generation for diverse and unseen domains could increase generalization across even broader application scenarios.

Conclusion

The "Pic2Word" approach marks a progressive step towards zero-shot retrieval systems by minimizing dependency on costly labeled data and robustly generalizing across multiple CIR tasks. This paradigm shift towards flexible, tokenized visual representation facilitates smoother integration of image and text queries, advancing the frontier of image retrieval research. The work not only sets a new benchmark for zero-shot CIR but also underscores the transformative potential of advanced vision-LLMs.

Essay composed by an AI based on the provided academic paper.

PDF Markdown

Related Papers

GitHub

GitHub - google-research/composed_image_retrieval (167 stars)