Composed Image Retrieval (CIR)
- Composed Image Retrieval (CIR) is a multimodal retrieval paradigm that fuses reference images with text modifications to retrieve target images.
- It leverages advanced vision-language pretraining and cross-modal fusion techniques, including attention and teacher-student networks, for effective attribute matching.
- CIR offers practical benefits in e-commerce and creative fields by providing fine-grained control beyond traditional image or text-only search methods.
Composed Image Retrieval (CIR) is a multimodal retrieval paradigm in which the system retrieves a target image in response to a user query composed of a reference image and a natural language modification text. CIR enables more precise and interactive image search by supporting queries that express detailed modifications to the reference, addressing the limitations of single-modality (image- or text-only) retrieval. CIR research draws on and advances multi-modal representation learning, vision-language pretraining, and cross-modal fusion methodologies.
1. Task Definition and Motivation
CIR is formally defined as: Given a reference image and a modification text (also called the relative caption), retrieve the target image such that is visually similar to but also satisfies the modifications specified in . This search protocol allows users to start with a specific instance (e.g., a photo of an item) and then articulate desired changes (e.g., “make this dress red and sleeveless”). CIR has direct applications in e-commerce product search, creative content retrieval, and interactive digital assistants.
The CIR formulation supports much finer control than traditional visual search, which typically retrieves visually similar images without incorporating user's editing intents, and goes beyond text-only retrieval, which often fails to capture the nuances of visual reference and grounded semantics.
2. Methodological Taxonomy
CIR models are systematically categorized across several dichotomies and architectural axes (Song et al., 19 Feb 2025):
Supervised CIR: Models trained on triplet data consisting of annotated examples.
- Feature Extraction: Early approaches utilized separate CNNs/RNNs (Song et al., 19 Feb 2025). Recent methods adopt Vision-Language Pre-trained (VLP) models such as CLIP or BLIP for unified multimodal representation extraction.
- Image–Text Fusion: Fusion strategies are commonly classified as:
- Explicit combination-based: Linear or MLP-based transformation of feature concatenations; residual editing (Wen et al., 2023, Bai et al., 2023).
- Neural network–based: Cross-attention modules, self-attention, or graph-attention networks for deep fusion (Jiang et al., 29 May 2024, Park et al., 17 Jul 2025).
- Prototype image generation: Generation of a query prototype synthesized in image space, converted into a matching problem in embedding space.
- Query Composition: The challenge is to determine, attribute by attribute, which information from must be “kept” and which should be “replaced” per . Advanced models (e.g., TG-CIR) address this with explicit mask-based “keep-and-replace” mechanisms using teacher-student branches (target-aware guidance during training, target-free inference) and knowledge distillation (Wen et al., 2023).
- Target Matching: Contrastive objectives such as batch-based classification, soft triplet loss, or metric learning incorporating KL-divergence between similarity distributions are used to align composed queries with target image embeddings (Wen et al., 2023, Feng et al., 17 Apr 2024).
Zero-shot CIR (ZS-CIR): Methods that avoid reliance on annotated triplets, seeking to enable CIR with minimal or no manual labeling (Song et al., 19 Feb 2025, Agnolucci et al., 5 May 2024, Lin et al., 25 Mar 2025, Jang et al., 23 Apr 2024).
- Textual inversion approaches: The reference image is mapped to a pseudo-word token in the pre-trained language embedding space (e.g., CLIP), combined with modification text, and processed by the text encoder; this reduces CIR to text→image retrieval (Agnolucci et al., 5 May 2024, Lin et al., 25 Mar 2025).
- Pseudo-triplet or synthetic data approaches: LLMs and image captioning models are leveraged to automatically generate triplets at scale, which are then used to finetune or pretrain CIR models, potentially with minimal supervision (Feng et al., 17 Apr 2024, Jang et al., 23 Apr 2024, Huynh et al., 25 Mar 2025, Li et al., 8 Jul 2025).
- Training-free and prompting methods: Query unification through raw-data fusion, prompt engineering, or weaving key attributes directly into input data for pre-trained VLP encoders, thereby fully exploiting the multimodal alignment without expensive fusion modules (Wen et al., 24 Apr 2024).
The design landscape also encompasses semi-supervised and few-shot approaches, where synthetic or pseudo-labeled triplets augment a small set of labeled data to improve scalability (Jang et al., 23 Apr 2024, Hou et al., 8 Jul 2024).
3. Key Architectural Principles
The principal challenges in CIR model design are (1) effective multimodal fusion, (2) conflict/consistency modeling between vision and language, and (3) robust metric learning for fine-grained ranking.
Multimodal Fusion Strategies
- Raw-Data Level Fusion: Approaches such as DQU-CIR concatenate a caption generated from with or embed modification keywords visually in (“writing on” key phrases), then encode each “unified query” with the respective VLP encoders, combining outputs adaptively via a learnable linear weighting (Wen et al., 24 Apr 2024).
- Feature-Level Fusion: Most CIR systems concatenate or fuse representations of and at the feature level, using MLPs or attention mechanisms. Nonlinear fusions can introduce embedding drift from the VLP-aligned semantic space; orthogonal regularization or attention-based cross-modal mapping is used to counteract this (Wen et al., 2023, Jiang et al., 29 May 2024).
- Cross-Attention and Knowledge Distillation: Dual-branch designs with teacher and student composition modules, as seen in TG-CIR, use contrastive learning not only for query-target matching but for mimicking target-based ideal fusion during training, thereby improving generalization at inference (Wen et al., 2023).
Conflict Relationship and Attribute Selection
A fundamental problem is to resolve whether to “keep” (inherit from ) or “replace” (adopt from ) each attribute dimension in the composed query (e.g., color, pattern, sleeve shape). Models implement mask prediction for attribute selection (e.g., via MLPs), supervised via knowledge distillation against a teacher using the actual target image during training (Wen et al., 2023).
Adaptive Matching and Metric Learning
- Attribute-wise similarity: Attribute-specific cosine similarity is used to compute matching degrees between composed query and candidates; these degrees are aggregated (via summation in teacher or mean pooling in student branches) (Wen et al., 2023).
- KL-divergence Regularization: Distributions of similarity between composed queries and a batch of candidate images are regularized against the visual similarity distribution between the target image and candidates, promoting adaptive ranking beyond simple hard positives/negatives (Wen et al., 2023).
- Contrastive and Classification Losses: InfoNCE batch-based contrastive loss is ubiquitous for aligning composed queries with target features, sometimes combined with additional auxiliary alignment or reconstruction losses (Bai et al., 2023, Feng et al., 17 Apr 2024).
4. Data Generation, Augmentation, and Benchmarking
Manual annotation of CIR triplets is costly and time-consuming, limiting the scalability of supervised CIR. Prominent augmentation and pseudo-labeling techniques include:
- LLM- and MLLM-driven Synthetic Triplet Generation: Automated pipelines use LLMs for difference captioning and T2I generative models for image synthesis, filtered by multimodal models for consistency and quality. The CIRHS dataset represents a large-scale, synthetic triplet collection produced in this manner (Li et al., 8 Jul 2025).
- Pseudo-triplet Guided Few-shot Learning: Methods generate triplets by masking and captioning unlabeled images or by captioning plausible target images, ranking candidate pairs by “challenge” and randomly sampling the most informative for fine-tuning (Hou et al., 8 Jul 2024).
- Benchmark Dataset Curation: Dedicated CIR benchmarks include FashionIQ, Shoes, CIRR, and open-domain benchmarks such as CIRCO (which features multiple ground truths and semantic categories per query) (Agnolucci et al., 5 May 2024).
- Evaluation Metrics: Standard metrics include Recall@k, subset recall for challenging negatives, and mAP@k (for datasets with multiple ground truths). Loss functions used during training are commonly batch-based classification/contrastive variants, e.g.,
where is the composed query, is the target image, is the similarity (usually cosine), and a temperature.
Dataset | Domain | Distinctive Features |
---|---|---|
FashionIQ | Fashion | Attribute-rich, triplet labels |
CIRR | Open domain | Fine-grained reasoning, single ground truth |
CIRCO | Open domain | Multi-ground truth, semantic categories |
5. Recent Advances and Experimental Findings
TG-CIR (Wen et al., 2023) exemplifies the best practices in contemporary CIR: It achieves strong Recall@k (e.g., 29% improvement on Shoes R@1 over previous baselines), validates the efficacy of integrating global and local attribute extraction, orthogonal regularization, and target-aware composition. Ablation studies confirm all components' contributions, including the teacher-student strategy and KL-divergence regularization.
Sentence-level prompt augmentation using BLIP-2 and Q-Former further improves CIR, especially for multi-object or complex modification queries (Bai et al., 2023), with up to 14 accuracy points improvement in R@10 (FashionIQ Shirt).
Plug-and-play data scaling (both positive and negative examples), as proposed in contrastive learning frameworks, raises ceiling performances for multiple baselines, significantly improving R@k on both FashionIQ and CIRR (Feng et al., 17 Apr 2024).
A plausible implication is that the future of CIR will further blend large-scale synthetic data generation, prompt-driven modification capture, and adaptive fusion strategies—potentially closing the performance gap between zero-shot/few-shot and fully supervised systems.
6. Open Challenges and Research Directions
Several research challenges and frontiers in CIR remain open:
- Scalability and Annotation Efficiency: Despite synthetic triplet generation, ensuring diversity, naturalness, and coverage in modifications—without incurring annotation bias—is an ongoing concern (Feng et al., 17 Apr 2024, Li et al., 8 Jul 2025).
- Fine-grained Attribute and Conflict Resolution: Modeling nuanced attribute substitution (especially spatial attributes, view/angle changes, or subjective/non-salient modifications) is still not consistently addressed (Bai et al., 2023).
- Interpretability and User Control: Techniques that can explain or visualize attribute-level “keep/replace” decisions promise to enhance trust and usability, but are not yet widely adopted.
- Generalization across Domains and Modalities: Robustness to out-of-distribution queries, extension to video, dialog, sketch-and-text queries, and dialog-based iterative search remain under active development (Song et al., 19 Feb 2025).
- Evaluation Benchmark Expansion: Datasets with richer ground truth, more diverse concepts, and semantic labeling (as in CIRCO) are needed for reliable evalution and to minimize false negatives (Agnolucci et al., 5 May 2024).
- LLM/VLM Embedding Quality and Fusion Strategy: How to optimally fuse image and text at the embedding or token level without incurring excessive semantic drift or modality bias is a rapidly evolving question, given the proliferation of new MLLMs and complex retrieval pipelines (Jiang et al., 29 May 2024, Huynh et al., 25 Mar 2025, Park et al., 17 Jul 2025).
In summary, CIR research has rapidly advanced in methodological sophistication and empirical capability, shifting from basic feature concatenations to attribute-aware, teacher-student, and prompt-augmented architectures, aided by automated data generation. Open challenges include achieving optimal fusion, robust generalization, data efficiency, and extending evaluation to richer, more realistic benchmarks.