Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 56 tok/s
Gemini 2.5 Pro 38 tok/s Pro
GPT-5 Medium 26 tok/s Pro
GPT-5 High 22 tok/s Pro
GPT-4o 84 tok/s Pro
Kimi K2 182 tok/s Pro
GPT OSS 120B 420 tok/s Pro
Claude Sonnet 4.5 30 tok/s Pro
2000 character limit reached

Training-Free Zero-Shot CIR

Updated 6 October 2025
  • Training-free zero-shot composed image retrieval (ZS-CIR) is defined as retrieving target images by combining a reference image with textual modifications without task-specific training.
  • It leverages pre-trained vision-language and language models within modular pipelines to enhance interpretability, generalization, and scalability across diverse domains.
  • ZS-CIR methods achieve competitive performance on benchmarks by employing techniques such as token mapping, chain-of-thought reasoning, and weighted feature fusion.

Training-free zero-shot composed image retrieval (ZS-CIR) is a class of vision-language retrieval methods designed to identify target images in a database by integrating visual input from a reference image and compositional intent from an associated text modifier—without requiring any supervised, task-specific training (e.g., annotated triplets). These approaches aim to maximize generalization, interpretability, and scalability by leveraging pre-trained vision-LLMs (VLMs), LLMs, and modular pipeline architectures. Recent progress in ZS-CIR demonstrates high effectiveness in diverse domains, including fashion, e-commerce, open-world scene search, and content creation.

1. Problem Definition and Objectives

ZS-CIR is defined as retrieving a target image I~\tilde{I} from a large gallery D\mathcal{D}, given a query comprising a reference image IrI_r and a textual modification TmT_m. The retrieval goal is: I~=argmaxIDsim(Q(Ir,Tm),F(I))\tilde{I} = \arg\max_{I \in \mathcal{D}} \text{sim}(Q(I_r, T_m), F(I)) where QQ denotes the composed query representation and FF the feature extraction for gallery images.

Key objectives include:

  • Eliminating the dependency on expensive annotated triplets (reference image, modification text, target image) for in-domain supervised training.
  • Achieving strong generalization when deploying across new content domains, manipulation types, or languages.
  • Enabling human-understandable, intervenable, and explainable retrieval workflows.
  • Maximizing performance under the constraint of minimal, non-task-specific parameter learning.

2. Methodological Taxonomy

ZS-CIR methods are diverse and have evolved rapidly. The following table summarizes notable methodological axes:

Class Core Operation Notable Papers
Textual inversion/token mapping Image mapped to pseudo-word(s) in CLIP space Pic2Word (Saito et al., 2023), SEARLE (Baldrati et al., 2023), iSEARLE (Agnolucci et al., 5 May 2024), FTI4CIR (Lin et al., 25 Mar 2025)
Language-centric modular pipeline Captioning + LLM-based rewriting CIReVL (Karthik et al., 2023), OSrCIR (Tang et al., 15 Dec 2024), SQUARE (Wu et al., 30 Sep 2025)
Multi-scale, CoT/LVLM reasoning LVLM performs joint vision-language reasoning CoTMR (Sun et al., 28 Feb 2025), MCoT-RE (Park et al., 17 Jul 2025)
Weighted feature fusion Direct fusion of image and text representations WeiMoCIR (Wu et al., 7 Sep 2024), SQUARE (Wu et al., 30 Sep 2025)
Synthetic label/hybrid strategies Pseudotriplets from VLM/LLM, hybrid pretext HyCIR (Jiang et al., 8 Jul 2024), DeG (Chen et al., 7 Mar 2025), MoTaDual (Li et al., 31 Oct 2024)
Plug-and-play embedding enhancements Dynamic prompt/embedding adjustment PDV (Tursun et al., 11 Feb 2025), Denoise-I2W (Tang et al., 22 Oct 2024)

These variants can be further subdivided according to their use of pseudo-token design, reasoning depth (one-stage or multi-stage), regularization strategies, and fusion/aggregation mechanisms.

3. Core Principles and Architectures

Token Mapping and Textual Inversion

Pioneered by Pic2Word (Saito et al., 2023) and SEARLE (Baldrati et al., 2023), these methods employ a lightweight mapping network fMf_M that projects a CLIP-encoded image v=fe(x)v = f_e(x) into a pseudo-word embedding s=fM(v)s = f_M(v). The pseudo-word is concatenated with a text modifier into a prompt template (e.g., "a photo of [T*], with blue floral print"). This is processed by the CLIP text encoder, facilitating early fusion and allowing pre-trained language compositionality to handle the interaction between IrI_r and TmT_m. Optimization uses a contrastive loss between projected pseudo-tokens and the original image embeddings.

Subsequent works improve the token mapping by adding regularization on the token manifold (GPT-based losses (Baldrati et al., 2023, Agnolucci et al., 5 May 2024)), noise injection to mitigate the modality gap (Agnolucci et al., 5 May 2024), denoising and intent-aware mapping (Tang et al., 22 Oct 2024), and fine-grained decomposition (subject + attribute tokens in FTI4CIR (Lin et al., 25 Mar 2025)).

Modular Language-First Reasoning

CIReVL (Karthik et al., 2023) formalizes a modular pipeline: a generative VLM (e.g., BLIP-2, CoCa) captions the reference image, an LLM rewrites the caption with knowledge of TmT_m, and CLIP retrieves images by matching the LLM output in text-to-image feature space. This approach is highly interpretable: all reasoning and composition occur in natural language and can be examined/intervened post-hoc.

OSrCIR (Tang et al., 15 Dec 2024), SQUARE (Wu et al., 30 Sep 2025), and MCoT-RE (Park et al., 17 Jul 2025) further combine the visual and textual input using chain-of-thought (CoT) reasoning within an MLLM at inference, sometimes producing separate captions focused on explicit modification (CModiC_{Modi}) and contextual preservation (CIntegC_{Integ}).

Multi-scale and Chain-of-Thought Reasoning

CoTMR (Sun et al., 28 Feb 2025) explicitly prompts a large VLM (e.g., Qwen2-VL) to reason at both global (image) and object scale using a CIRCoT protocol: it produces both a holistic target caption and explicit lists of must-have and must-not-have objects. A multi-grained scoring mechanism jointly considers these outputs in ranking candidates.

Weighted Fusion and Embedding Adjustments

Some approaches, e.g., WeiMoCIR (Wu et al., 7 Sep 2024), simply average or interpolate image and text embeddings for the query: q=(1α)v+αtq = (1-\alpha) \cdot v + \alpha \cdot t and enhance candidate image representations by adding MLLM-generated captions to the scoring.

Plug-and-play enhancements such as Prompt Directional Vectors (PDV) (Tursun et al., 11 Feb 2025) capture the semantic shift from prompt text as a residual vector and allow dynamic adjustment/scaling of composed embeddings, as well as weighted fusions: ΦPDV–F=(1β)(ΨI(Ir)+αIΔPDV)+β(ΨT(F(Ir))+αTΔPDV)\Phi_{\text{PDV--F}} = (1-\beta)\cdot(\Psi_I(I_r) + \alpha_I \Delta_{PDV}) + \beta\cdot(\Psi_T(\mathcal{F}(I_r)) + \alpha_T \Delta_{PDV})

Hybrid and Data-efficient Strategies

HyCIR (Jiang et al., 8 Jul 2024) constructs synthetic triplets by extracting visually similar image pairs, generating edit instructions via VLM+LLM, and filtering for semantic coherence. Training optimizes both zero-shot image-to-text (ZS-CIR) and CIR-triplet losses.

DeG (Chen et al., 7 Mar 2025) addresses modality discrepancy (image vs text distribution) by supplementing image tokens with semantic information from captions and applies a selective "Semantic Set" batch mining strategy for data efficiency.

4. Datasets, Evaluation Protocols, and Performance

Several benchmarks have become standard for ZS-CIR evaluation:

Dataset Domain Query structure Metrics
CIRR Open-domain, natural scenes Reference + relative caption Recall@K, mAP
CIRCO General objects from COCO, multi-gt Reference + relative caption Recall@K, mAP
FashionIQ Fine-grained, fashion apparel Reference + attribute modification Recall@10, @50
COCO Comp Object composition/manipulation Reference + compositional text Recall@K
Additional e.g., GeneCIS, ImageNet domain conversion Various Task-specific

ZS-CIR methods are consistently compared against both zero-shot and fully supervised CIR baselines. Notably:

Performance gains are more pronounced where multi-modal reasoning is used and redundancy or modality gap is explicitly addressed.

5. Interpretability, Scalability, and Real-world Applications

ZS-CIR’s training-free paradigm means models can directly leverage pre-trained foundation models, making them flexible and scalable:

  • The language-centric pipeline (e.g., CIReVL, SQUARE, OSrCIR) is highly interpretable—every intermediate reasoning step and representation is readable or modifiable by a human.
  • The explicit transformation of multimodal queries into natural language or pseudo-tokens enables transparency and potential for human-in-the-loop corrections.
  • Methods only require frozen pre-trained weights, allowing application to new domains with no task-specific fine-tuning or annotation. This property is ideal for e-commerce, creative content search, design suggestion, and agile interactive systems.

Batch reranking with MLLMs (e.g., SQUARE (Wu et al., 30 Sep 2025)), local concept verification (LCR (Sun et al., 2023)), and chain-of-thought (CoTMR (Sun et al., 28 Feb 2025), MCoT-RE (Park et al., 17 Jul 2025)) further enhance sample efficiency and retrieval specificity—key for extremely large or heterogeneous galleries.

6. Limitations and Future Directions

Current limitations include:

  • Coarse-grained pseudo-tokens in classical textual inversion can insufficiently capture fine local attributes, motivating research into multi-token and fine-grained decomposition (FTI4CIR (Lin et al., 25 Mar 2025)).
  • Language-first pipelines may lose critical visual detail if the reference image’s semantic content is inadequately described or if prompt engineering is suboptimal.
  • There remain open challenges in cross-domain robustness, modality gap (image vs text embedding), and scaling to highly complex manipulation instructions.

Open research directions include:

7. Representative Mathematical Formulations and Open-source Ecosystem

ZS-CIR methods are grounded in a set of core mathematical structures, primarily centered on vision-language contrastive objectives: Lcon=Lt2i+Li2tL_{\text{con}} = L_{t2i} + L_{i2t} where, for features u,vu, v and temperature τ\tau,

Lt2i=logexp(τuivi)jexp(τuivj)L_{t2i} = -\log \frac{\exp(\tau \cdot u_i^\top v_i)}{\sum_j \exp(\tau \cdot u_i^\top v_j)}

Recent extensions introduce regularization losses (e.g., GPT-based, MSE), prompt scaling (PDV), and multi-component query fusion: q=(1α)v+αt;q=(1β)q+βMLLM(Ir,Tm)q = (1 - \alpha) \cdot v + \alpha \cdot t; \quad q' = (1 - \beta) \cdot q + \beta \cdot \text{MLLM}(I_r, T_m)

Open-source code and detailed evaluation protocols for nearly all major methods are publicly provided (e.g., github.com/google-research/composed_image_retrieval (Saito et al., 2023), github.com/miccunifi/SEARLE (Baldrati et al., 2023, Agnolucci et al., 5 May 2024), github.com/navervision/lincir (Gu et al., 2023), github.com/Pter61/denoise-i2w-tmm (Tang et al., 22 Oct 2024), github.com/Chen-Junyang-cn/PLI (Chen et al., 2023), github.com/whats2000/WeiMoCIR (Wu et al., 7 Sep 2024)).


ZS-CIR represents a maturing subfield of vision–language retrieval, with its training-free methods offering robust, scalable, and interpretable systems capable of competitive or superior performance to supervised models across several challenging benchmarks. Ongoing research is focused on higher granularity of tokenization, deeper compositional reasoning, improved cross-modal alignment, and sample efficiency—all with the goal of robustly capturing nuanced user intent at scale.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (18)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Training-free Zero-shot CIR (ZS-CIR).

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube