Training-Free Zero-Shot CIR
- Training-free zero-shot composed image retrieval (ZS-CIR) is defined as retrieving target images by combining a reference image with textual modifications without task-specific training.
- It leverages pre-trained vision-language and language models within modular pipelines to enhance interpretability, generalization, and scalability across diverse domains.
- ZS-CIR methods achieve competitive performance on benchmarks by employing techniques such as token mapping, chain-of-thought reasoning, and weighted feature fusion.
Training-free zero-shot composed image retrieval (ZS-CIR) is a class of vision-language retrieval methods designed to identify target images in a database by integrating visual input from a reference image and compositional intent from an associated text modifier—without requiring any supervised, task-specific training (e.g., annotated triplets). These approaches aim to maximize generalization, interpretability, and scalability by leveraging pre-trained vision-LLMs (VLMs), LLMs, and modular pipeline architectures. Recent progress in ZS-CIR demonstrates high effectiveness in diverse domains, including fashion, e-commerce, open-world scene search, and content creation.
1. Problem Definition and Objectives
ZS-CIR is defined as retrieving a target image from a large gallery , given a query comprising a reference image and a textual modification . The retrieval goal is: where denotes the composed query representation and the feature extraction for gallery images.
Key objectives include:
- Eliminating the dependency on expensive annotated triplets (reference image, modification text, target image) for in-domain supervised training.
- Achieving strong generalization when deploying across new content domains, manipulation types, or languages.
- Enabling human-understandable, intervenable, and explainable retrieval workflows.
- Maximizing performance under the constraint of minimal, non-task-specific parameter learning.
2. Methodological Taxonomy
ZS-CIR methods are diverse and have evolved rapidly. The following table summarizes notable methodological axes:
Class | Core Operation | Notable Papers |
---|---|---|
Textual inversion/token mapping | Image mapped to pseudo-word(s) in CLIP space | Pic2Word (Saito et al., 2023), SEARLE (Baldrati et al., 2023), iSEARLE (Agnolucci et al., 5 May 2024), FTI4CIR (Lin et al., 25 Mar 2025) |
Language-centric modular pipeline | Captioning + LLM-based rewriting | CIReVL (Karthik et al., 2023), OSrCIR (Tang et al., 15 Dec 2024), SQUARE (Wu et al., 30 Sep 2025) |
Multi-scale, CoT/LVLM reasoning | LVLM performs joint vision-language reasoning | CoTMR (Sun et al., 28 Feb 2025), MCoT-RE (Park et al., 17 Jul 2025) |
Weighted feature fusion | Direct fusion of image and text representations | WeiMoCIR (Wu et al., 7 Sep 2024), SQUARE (Wu et al., 30 Sep 2025) |
Synthetic label/hybrid strategies | Pseudotriplets from VLM/LLM, hybrid pretext | HyCIR (Jiang et al., 8 Jul 2024), DeG (Chen et al., 7 Mar 2025), MoTaDual (Li et al., 31 Oct 2024) |
Plug-and-play embedding enhancements | Dynamic prompt/embedding adjustment | PDV (Tursun et al., 11 Feb 2025), Denoise-I2W (Tang et al., 22 Oct 2024) |
These variants can be further subdivided according to their use of pseudo-token design, reasoning depth (one-stage or multi-stage), regularization strategies, and fusion/aggregation mechanisms.
3. Core Principles and Architectures
Token Mapping and Textual Inversion
Pioneered by Pic2Word (Saito et al., 2023) and SEARLE (Baldrati et al., 2023), these methods employ a lightweight mapping network that projects a CLIP-encoded image into a pseudo-word embedding . The pseudo-word is concatenated with a text modifier into a prompt template (e.g., "a photo of [T*], with blue floral print"). This is processed by the CLIP text encoder, facilitating early fusion and allowing pre-trained language compositionality to handle the interaction between and . Optimization uses a contrastive loss between projected pseudo-tokens and the original image embeddings.
Subsequent works improve the token mapping by adding regularization on the token manifold (GPT-based losses (Baldrati et al., 2023, Agnolucci et al., 5 May 2024)), noise injection to mitigate the modality gap (Agnolucci et al., 5 May 2024), denoising and intent-aware mapping (Tang et al., 22 Oct 2024), and fine-grained decomposition (subject + attribute tokens in FTI4CIR (Lin et al., 25 Mar 2025)).
Modular Language-First Reasoning
CIReVL (Karthik et al., 2023) formalizes a modular pipeline: a generative VLM (e.g., BLIP-2, CoCa) captions the reference image, an LLM rewrites the caption with knowledge of , and CLIP retrieves images by matching the LLM output in text-to-image feature space. This approach is highly interpretable: all reasoning and composition occur in natural language and can be examined/intervened post-hoc.
OSrCIR (Tang et al., 15 Dec 2024), SQUARE (Wu et al., 30 Sep 2025), and MCoT-RE (Park et al., 17 Jul 2025) further combine the visual and textual input using chain-of-thought (CoT) reasoning within an MLLM at inference, sometimes producing separate captions focused on explicit modification () and contextual preservation ().
Multi-scale and Chain-of-Thought Reasoning
CoTMR (Sun et al., 28 Feb 2025) explicitly prompts a large VLM (e.g., Qwen2-VL) to reason at both global (image) and object scale using a CIRCoT protocol: it produces both a holistic target caption and explicit lists of must-have and must-not-have objects. A multi-grained scoring mechanism jointly considers these outputs in ranking candidates.
Weighted Fusion and Embedding Adjustments
Some approaches, e.g., WeiMoCIR (Wu et al., 7 Sep 2024), simply average or interpolate image and text embeddings for the query: and enhance candidate image representations by adding MLLM-generated captions to the scoring.
Plug-and-play enhancements such as Prompt Directional Vectors (PDV) (Tursun et al., 11 Feb 2025) capture the semantic shift from prompt text as a residual vector and allow dynamic adjustment/scaling of composed embeddings, as well as weighted fusions:
Hybrid and Data-efficient Strategies
HyCIR (Jiang et al., 8 Jul 2024) constructs synthetic triplets by extracting visually similar image pairs, generating edit instructions via VLM+LLM, and filtering for semantic coherence. Training optimizes both zero-shot image-to-text (ZS-CIR) and CIR-triplet losses.
DeG (Chen et al., 7 Mar 2025) addresses modality discrepancy (image vs text distribution) by supplementing image tokens with semantic information from captions and applies a selective "Semantic Set" batch mining strategy for data efficiency.
4. Datasets, Evaluation Protocols, and Performance
Several benchmarks have become standard for ZS-CIR evaluation:
Dataset | Domain | Query structure | Metrics |
---|---|---|---|
CIRR | Open-domain, natural scenes | Reference + relative caption | Recall@K, mAP |
CIRCO | General objects from COCO, multi-gt | Reference + relative caption | Recall@K, mAP |
FashionIQ | Fine-grained, fashion apparel | Reference + attribute modification | Recall@10, @50 |
COCO Comp | Object composition/manipulation | Reference + compositional text | Recall@K |
Additional | e.g., GeneCIS, ImageNet domain conversion | Various | Task-specific |
ZS-CIR methods are consistently compared against both zero-shot and fully supervised CIR baselines. Notably:
- Methods such as Pic2Word (Saito et al., 2023), SEARLE (Baldrati et al., 2023), iSEARLE (Agnolucci et al., 5 May 2024), and FTI4CIR (Lin et al., 25 Mar 2025) demonstrate that textual inversion with learned (pseudo-)token mapping networks can outperform several supervised baselines, especially on CIRR and FashionIQ.
- LLM-based, language-focused, and chain-of-thought variants (CIReVL (Karthik et al., 2023), OSrCIR (Tang et al., 15 Dec 2024), CoTMR (Sun et al., 28 Feb 2025), MCoT-RE (Park et al., 17 Jul 2025)) report even higher recall and mAP scores, with OSrCIR achieving 23.87% mAP@5 on CIRCO, improving by 4.95% over the best two-stage LLM baseline.
- Plug-and-play residual and fusion enhancements (PDV (Tursun et al., 11 Feb 2025), Denoise-I2W (Tang et al., 22 Oct 2024)) provide consistent performance boosts (typically 1–4% recall or mAP improvements).
Performance gains are more pronounced where multi-modal reasoning is used and redundancy or modality gap is explicitly addressed.
5. Interpretability, Scalability, and Real-world Applications
ZS-CIR’s training-free paradigm means models can directly leverage pre-trained foundation models, making them flexible and scalable:
- The language-centric pipeline (e.g., CIReVL, SQUARE, OSrCIR) is highly interpretable—every intermediate reasoning step and representation is readable or modifiable by a human.
- The explicit transformation of multimodal queries into natural language or pseudo-tokens enables transparency and potential for human-in-the-loop corrections.
- Methods only require frozen pre-trained weights, allowing application to new domains with no task-specific fine-tuning or annotation. This property is ideal for e-commerce, creative content search, design suggestion, and agile interactive systems.
Batch reranking with MLLMs (e.g., SQUARE (Wu et al., 30 Sep 2025)), local concept verification (LCR (Sun et al., 2023)), and chain-of-thought (CoTMR (Sun et al., 28 Feb 2025), MCoT-RE (Park et al., 17 Jul 2025)) further enhance sample efficiency and retrieval specificity—key for extremely large or heterogeneous galleries.
6. Limitations and Future Directions
Current limitations include:
- Coarse-grained pseudo-tokens in classical textual inversion can insufficiently capture fine local attributes, motivating research into multi-token and fine-grained decomposition (FTI4CIR (Lin et al., 25 Mar 2025)).
- Language-first pipelines may lose critical visual detail if the reference image’s semantic content is inadequately described or if prompt engineering is suboptimal.
- There remain open challenges in cross-domain robustness, modality gap (image vs text embedding), and scaling to highly complex manipulation instructions.
Open research directions include:
- Exploring richer, intent-aware multi-token mapping (see also Denoise-I2W (Tang et al., 22 Oct 2024)) and hierarchical chain-of-thought prompting for grounded inference.
- Data-efficient generalization (DeG (Chen et al., 7 Mar 2025)) via careful mining and selection within large web datasets.
- Integrating continuous prompt-parameter adaptation (PDV (Tursun et al., 11 Feb 2025)) and dynamic fusion strategies for context-aware user control.
- Expanding synthetic triplet pipelines to broader semantic domains and optimizing for both efficiency and compositionality (HyCIR (Jiang et al., 8 Jul 2024), MoTaDual (Li et al., 31 Oct 2024)).
- Improved evaluation protocols to address ambiguous or underspecified queries and to further reduce annotation biases (CIRCO design (Baldrati et al., 2023, Agnolucci et al., 5 May 2024)).
7. Representative Mathematical Formulations and Open-source Ecosystem
ZS-CIR methods are grounded in a set of core mathematical structures, primarily centered on vision-language contrastive objectives: where, for features and temperature ,
Recent extensions introduce regularization losses (e.g., GPT-based, MSE), prompt scaling (PDV), and multi-component query fusion:
Open-source code and detailed evaluation protocols for nearly all major methods are publicly provided (e.g., github.com/google-research/composed_image_retrieval (Saito et al., 2023), github.com/miccunifi/SEARLE (Baldrati et al., 2023, Agnolucci et al., 5 May 2024), github.com/navervision/lincir (Gu et al., 2023), github.com/Pter61/denoise-i2w-tmm (Tang et al., 22 Oct 2024), github.com/Chen-Junyang-cn/PLI (Chen et al., 2023), github.com/whats2000/WeiMoCIR (Wu et al., 7 Sep 2024)).
ZS-CIR represents a maturing subfield of vision–language retrieval, with its training-free methods offering robust, scalable, and interpretable systems capable of competitive or superior performance to supervised models across several challenging benchmarks. Ongoing research is focused on higher granularity of tokenization, deeper compositional reasoning, improved cross-modal alignment, and sample efficiency—all with the goal of robustly capturing nuanced user intent at scale.