Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 56 tok/s

Gemini 2.5 Pro 38 tok/s Pro

GPT-5 Medium 26 tok/s Pro

GPT-5 High 22 tok/s Pro

GPT-4o 84 tok/s Pro

Kimi K2 182 tok/s Pro

GPT OSS 120B 420 tok/s Pro

Claude Sonnet 4.5 30 tok/s Pro

2000 character limit reached

Training-Free Zero-Shot CIR

Updated 6 October 2025

Training-free zero-shot composed image retrieval (ZS-CIR) is defined as retrieving target images by combining a reference image with textual modifications without task-specific training.
It leverages pre-trained vision-language and language models within modular pipelines to enhance interpretability, generalization, and scalability across diverse domains.
ZS-CIR methods achieve competitive performance on benchmarks by employing techniques such as token mapping, chain-of-thought reasoning, and weighted feature fusion.

Training-free zero-shot composed image retrieval (ZS-CIR) is a class of vision-language retrieval methods designed to identify target images in a database by integrating visual input from a reference image and compositional intent from an associated text modifier—without requiring any supervised, task-specific training (e.g., annotated triplets). These approaches aim to maximize generalization, interpretability, and scalability by leveraging pre-trained vision-LLMs (VLMs), LLMs, and modular pipeline architectures. Recent progress in ZS-CIR demonstrates high effectiveness in diverse domains, including fashion, e-commerce, open-world scene search, and content creation.

1. Problem Definition and Objectives

ZS-CIR is defined as retrieving a target image $\tilde{I}$ from a large gallery $\mathcal{D}$ , given a query comprising a reference image $I_r$ and a textual modification $T_m$ . The retrieval goal is: $\tilde{I} = \arg\max_{I \in \mathcal{D}} \text{sim}(Q(I_r, T_m), F(I))$ where $Q$ denotes the composed query representation and $F$ the feature extraction for gallery images.

Key objectives include:

Eliminating the dependency on expensive annotated triplets (reference image, modification text, target image) for in-domain supervised training.
Achieving strong generalization when deploying across new content domains, manipulation types, or languages.
Enabling human-understandable, intervenable, and explainable retrieval workflows.
Maximizing performance under the constraint of minimal, non-task-specific parameter learning.

2. Methodological Taxonomy

ZS-CIR methods are diverse and have evolved rapidly. The following table summarizes notable methodological axes:

Class	Core Operation	Notable Papers
Textual inversion/token mapping	Image mapped to pseudo-word(s) in CLIP space	Pic2Word (Saito et al., 2023), SEARLE (Baldrati et al., 2023), iSEARLE (Agnolucci et al., 5 May 2024), FTI4CIR (Lin et al., 25 Mar 2025)
Language-centric modular pipeline	Captioning + LLM-based rewriting	CIReVL (Karthik et al., 2023), OSrCIR (Tang et al., 15 Dec 2024), SQUARE (Wu et al., 30 Sep 2025)
Multi-scale, CoT/LVLM reasoning	LVLM performs joint vision-language reasoning	CoTMR (Sun et al., 28 Feb 2025), MCoT-RE (Park et al., 17 Jul 2025)
Weighted feature fusion	Direct fusion of image and text representations	WeiMoCIR (Wu et al., 7 Sep 2024), SQUARE (Wu et al., 30 Sep 2025)
Synthetic label/hybrid strategies	Pseudotriplets from VLM/LLM, hybrid pretext	HyCIR (Jiang et al., 8 Jul 2024), DeG (Chen et al., 7 Mar 2025), MoTaDual (Li et al., 31 Oct 2024)
Plug-and-play embedding enhancements	Dynamic prompt/embedding adjustment	PDV (Tursun et al., 11 Feb 2025), Denoise-I2W (Tang et al., 22 Oct 2024)

These variants can be further subdivided according to their use of pseudo-token design, reasoning depth (one-stage or multi-stage), regularization strategies, and fusion/aggregation mechanisms.

3. Core Principles and Architectures

Token Mapping and Textual Inversion

Pioneered by Pic2Word (Saito et al., 2023) and SEARLE (Baldrati et al., 2023), these methods employ a lightweight mapping network $f_M$ that projects a CLIP-encoded image $v = f_e(x)$ into a pseudo-word embedding $s = f_M(v)$ . The pseudo-word is concatenated with a text modifier into a prompt template (e.g., "a photo of [T*], with blue floral print"). This is processed by the CLIP text encoder, facilitating early fusion and allowing pre-trained language compositionality to handle the interaction between $I_r$ and $T_m$ . Optimization uses a contrastive loss between projected pseudo-tokens and the original image embeddings.

Subsequent works improve the token mapping by adding regularization on the token manifold (GPT-based losses (Baldrati et al., 2023, Agnolucci et al., 5 May 2024)), noise injection to mitigate the modality gap (Agnolucci et al., 5 May 2024), denoising and intent-aware mapping (Tang et al., 22 Oct 2024), and fine-grained decomposition (subject + attribute tokens in FTI4CIR (Lin et al., 25 Mar 2025)).

Modular Language-First Reasoning

CIReVL (Karthik et al., 2023) formalizes a modular pipeline: a generative VLM (e.g., BLIP-2, CoCa) captions the reference image, an LLM rewrites the caption with knowledge of $T_m$ , and CLIP retrieves images by matching the LLM output in text-to-image feature space. This approach is highly interpretable: all reasoning and composition occur in natural language and can be examined/intervened post-hoc.

OSrCIR (Tang et al., 15 Dec 2024), SQUARE (Wu et al., 30 Sep 2025), and MCoT-RE (Park et al., 17 Jul 2025) further combine the visual and textual input using chain-of-thought (CoT) reasoning within an MLLM at inference, sometimes producing separate captions focused on explicit modification ( $C_{Modi}$ ) and contextual preservation ( $C_{Integ}$ ).

Multi-scale and Chain-of-Thought Reasoning

CoTMR (Sun et al., 28 Feb 2025) explicitly prompts a large VLM (e.g., Qwen2-VL) to reason at both global (image) and object scale using a CIRCoT protocol: it produces both a holistic target caption and explicit lists of must-have and must-not-have objects. A multi-grained scoring mechanism jointly considers these outputs in ranking candidates.

Weighted Fusion and Embedding Adjustments

Some approaches, e.g., WeiMoCIR (Wu et al., 7 Sep 2024), simply average or interpolate image and text embeddings for the query: $q = (1-\alpha) \cdot v + \alpha \cdot t$ and enhance candidate image representations by adding MLLM-generated captions to the scoring.

Plug-and-play enhancements such as Prompt Directional Vectors (PDV) (Tursun et al., 11 Feb 2025) capture the semantic shift from prompt text as a residual vector and allow dynamic adjustment/scaling of composed embeddings, as well as weighted fusions: $\Phi_{\text{PDV--F}} = (1-\beta)\cdot(\Psi_I(I_r) + \alpha_I \Delta_{PDV}) + \beta\cdot(\Psi_T(\mathcal{F}(I_r)) + \alpha_T \Delta_{PDV})$

Hybrid and Data-efficient Strategies

HyCIR (Jiang et al., 8 Jul 2024) constructs synthetic triplets by extracting visually similar image pairs, generating edit instructions via VLM+LLM, and filtering for semantic coherence. Training optimizes both zero-shot image-to-text (ZS-CIR) and CIR-triplet losses.

DeG (Chen et al., 7 Mar 2025) addresses modality discrepancy (image vs text distribution) by supplementing image tokens with semantic information from captions and applies a selective "Semantic Set" batch mining strategy for data efficiency.

4. Datasets, Evaluation Protocols, and Performance

Several benchmarks have become standard for ZS-CIR evaluation:

Dataset	Domain	Query structure	Metrics
CIRR	Open-domain, natural scenes	Reference + relative caption	Recall@K, mAP
CIRCO	General objects from COCO, multi-gt	Reference + relative caption	Recall@K, mAP
FashionIQ	Fine-grained, fashion apparel	Reference + attribute modification	Recall@10, @50
COCO Comp	Object composition/manipulation	Reference + compositional text	Recall@K
Additional	e.g., GeneCIS, ImageNet domain conversion	Various	Task-specific

ZS-CIR methods are consistently compared against both zero-shot and fully supervised CIR baselines. Notably:

Methods such as Pic2Word (Saito et al., 2023), SEARLE (Baldrati et al., 2023), iSEARLE (Agnolucci et al., 5 May 2024), and FTI4CIR (Lin et al., 25 Mar 2025) demonstrate that textual inversion with learned (pseudo-)token mapping networks can outperform several supervised baselines, especially on CIRR and FashionIQ.
LLM-based, language-focused, and chain-of-thought variants (CIReVL (Karthik et al., 2023), OSrCIR (Tang et al., 15 Dec 2024), CoTMR (Sun et al., 28 Feb 2025), MCoT-RE (Park et al., 17 Jul 2025)) report even higher recall and mAP scores, with OSrCIR achieving 23.87% mAP@5 on CIRCO, improving by 4.95% over the best two-stage LLM baseline.
Plug-and-play residual and fusion enhancements (PDV (Tursun et al., 11 Feb 2025), Denoise-I2W (Tang et al., 22 Oct 2024)) provide consistent performance boosts (typically 1–4% recall or mAP improvements).

Performance gains are more pronounced where multi-modal reasoning is used and redundancy or modality gap is explicitly addressed.

5. Interpretability, Scalability, and Real-world Applications

ZS-CIR’s training-free paradigm means models can directly leverage pre-trained foundation models, making them flexible and scalable:

The language-centric pipeline (e.g., CIReVL, SQUARE, OSrCIR) is highly interpretable—every intermediate reasoning step and representation is readable or modifiable by a human.
The explicit transformation of multimodal queries into natural language or pseudo-tokens enables transparency and potential for human-in-the-loop corrections.
Methods only require frozen pre-trained weights, allowing application to new domains with no task-specific fine-tuning or annotation. This property is ideal for e-commerce, creative content search, design suggestion, and agile interactive systems.

Batch reranking with MLLMs (e.g., SQUARE (Wu et al., 30 Sep 2025)), local concept verification (LCR (Sun et al., 2023)), and chain-of-thought (CoTMR (Sun et al., 28 Feb 2025), MCoT-RE (Park et al., 17 Jul 2025)) further enhance sample efficiency and retrieval specificity—key for extremely large or heterogeneous galleries.

6. Limitations and Future Directions

Current limitations include:

Coarse-grained pseudo-tokens in classical textual inversion can insufficiently capture fine local attributes, motivating research into multi-token and fine-grained decomposition (FTI4CIR (Lin et al., 25 Mar 2025)).
Language-first pipelines may lose critical visual detail if the reference image’s semantic content is inadequately described or if prompt engineering is suboptimal.
There remain open challenges in cross-domain robustness, modality gap (image vs text embedding), and scaling to highly complex manipulation instructions.

Open research directions include:

Exploring richer, intent-aware multi-token mapping (see also Denoise-I2W (Tang et al., 22 Oct 2024)) and hierarchical chain-of-thought prompting for grounded inference.
Data-efficient generalization (DeG (Chen et al., 7 Mar 2025)) via careful mining and selection within large web datasets.
Integrating continuous prompt-parameter adaptation (PDV (Tursun et al., 11 Feb 2025)) and dynamic fusion strategies for context-aware user control.
Expanding synthetic triplet pipelines to broader semantic domains and optimizing for both efficiency and compositionality (HyCIR (Jiang et al., 8 Jul 2024), MoTaDual (Li et al., 31 Oct 2024)).
Improved evaluation protocols to address ambiguous or underspecified queries and to further reduce annotation biases (CIRCO design (Baldrati et al., 2023, Agnolucci et al., 5 May 2024)).

7. Representative Mathematical Formulations and Open-source Ecosystem

ZS-CIR methods are grounded in a set of core mathematical structures, primarily centered on vision-language contrastive objectives: $L_{\text{con}} = L_{t2i} + L_{i2t}$ where, for features $u, v$ and temperature $\tau$ ,

$L_{t2i} = -\log \frac{\exp(\tau \cdot u_i^\top v_i)}{\sum_j \exp(\tau \cdot u_i^\top v_j)}$

Recent extensions introduce regularization losses (e.g., GPT-based, MSE), prompt scaling (PDV), and multi-component query fusion: $q = (1 - \alpha) \cdot v + \alpha \cdot t; \quad q' = (1 - \beta) \cdot q + \beta \cdot \text{MLLM}(I_r, T_m)$

Open-source code and detailed evaluation protocols for nearly all major methods are publicly provided (e.g., github.com/google-research/composed_image_retrieval (Saito et al., 2023), github.com/miccunifi/SEARLE (Baldrati et al., 2023, Agnolucci et al., 5 May 2024), github.com/navervision/lincir (Gu et al., 2023), github.com/Pter61/denoise-i2w-tmm (Tang et al., 22 Oct 2024), github.com/Chen-Junyang-cn/PLI (Chen et al., 2023), github.com/whats2000/WeiMoCIR (Wu et al., 7 Sep 2024)).

ZS-CIR represents a maturing subfield of vision–language retrieval, with its training-free methods offering robust, scalable, and interpretable systems capable of competitive or superior performance to supervised models across several challenging benchmarks. Ongoing research is focused on higher granularity of tokenization, deeper compositional reasoning, improved cross-modal alignment, and sample efficiency—all with the goal of robustly capturing nuanced user intent at scale.