Papers
Topics
Authors
Recent
Search
2000 character limit reached

Instance-Level Composed Image Retrieval

Published 29 Oct 2025 in cs.CV | (2510.25387v2)

Abstract: The progress of composed image retrieval (CIR), a popular research direction in image retrieval, where a combined visual and textual query is used, is held back by the absence of high-quality training and evaluation data. We introduce a new evaluation dataset, i-CIR, which, unlike existing datasets, focuses on an instance-level class definition. The goal is to retrieve images that contain the same particular object as the visual query, presented under a variety of modifications defined by textual queries. Its design and curation process keep the dataset compact to facilitate future research, while maintaining its challenge-comparable to retrieval among more than 40M random distractors-through a semi-automated selection of hard negatives. To overcome the challenge of obtaining clean, diverse, and suitable training data, we leverage pre-trained vision-and-LLMs (VLMs) in a training-free approach called BASIC. The method separately estimates query-image-to-image and query-text-to-image similarities, performing late fusion to upweight images that satisfy both queries, while down-weighting those that exhibit high similarity with only one of the two. Each individual similarity is further improved by a set of components that are simple and intuitive. BASIC sets a new state of the art on i-CIR but also on existing CIR datasets that follow a semantic-level class definition. Project page: https://vrg.fel.cvut.cz/icir/.

Summary

  • The paper introduces an instance-centric CIR evaluation dataset that resolves annotation ambiguities by focusing on physical object instances under diverse modifications.
  • The paper presents FreeDom, a training-free baseline leveraging pretrained VLMs with bias removal, semantic subspace projection, and multiplicative fusion to achieve a macro-mAP of 31.6%.
  • The paper demonstrates that careful dataset curation and explicit modality treatment can yield competitive, scalable retrieval performance without the need for retraining.

Instance-Level Composed Image Retrieval: Methods and Benchmarking

Introduction and Motivation

The paper "Instance-Level Composed Image Retrieval" (2510.25387) addresses significant limitations in the evaluation and development of composed image retrieval (CIR), where the objective is to retrieve database images matching both a query image and an associated textual modification. Previous CIR datasets largely operate at a class/semantic level, suffer from ambiguous positive pairs, and are vulnerable to unimodal shortcuts. This work introduces an instance-centric CIR evaluation protocol and dataset, providing a strong empirical baseline named FreeDom—a training-free method leveraging pretrained vision-LLMs (VLMs) with principled modality fusion.

Dataset Design: Instance-Level CIR

The central contribution is the construction of an instance-level CIR dataset, focusing on retrieval of images depicting the same physical instance of an object (e.g., a particular landmark or product), under diverse textual modifications (e.g., "at sunset", "from an aerial viewpoint"). The dataset curation process involves selection of meaningful modifications per instance, manual verification of positive images, and systematic mining of hard negatives: visual (matching the instance, not the modifier), textual (matching the modification, not the instance), and composed (almost—but not fully—matching both).

The design facilitates precise evaluation by:

  • Preventing annotation ambiguity and false negatives,
  • Including multiple, challenging negative types,
  • Supporting photo-real and context-shifted retrieval scenarios,
  • Balancing compactness (average 3.7K images/database) and retrieval difficulty (equivalent to >40M random distractors in LAION, as demonstrated experimentally).

FreeDom: A Training-Free, Strong Baseline for CIR

FreeDom is built to exploit pretrained VLMs such as CLIP. It dispenses with labor-intensive or synthetic triplet mining, instead computing and fusing modality-specific similarities (image-to-image and text-to-image) directly on frozen representations. The key technical innovations and system components are as follows:

  • Bias Removal via Centering: Mean feature subtraction mitigates modality-specific distributional bias, computed on large external datasets (e.g., LAION for vision, a curated concept corpus for text).
  • Semantic Subspace Projection: Image embeddings are projected onto a subspace optimized to retain object-specific (not style or context) information. This subspace is constructed from textual representations with positive and negative corpora, using top-k eigenvectors of a contrastive covariance matrix. This step enhances instance discrimination and modifier-invariance.
  • Textual Query Contextualization: To bridge CLIP's pretraining mismatch (full captions versus partial modifier text), queries are augmented using a subject corpus, forming caption-like queries for more robust text embedding.
  • Score Normalization and Fusion: Modal scores are min-normalized (to account for range imbalances), then multiplied (logical AND) and regularized with a Harris-corner-inspired criterion to penalize results only strong in a single modality. Figure 1

    Figure 2: Overview of the FreeDom pipeline, including centering, semantic projection, text contextualization, modality scoring, and multiplicative Harris-regularized fusion.

All these operations are efficient and query-side, requiring no re-embedding or backpropagation on the database side and thus are suitable for large-scale, real-world image retrieval systems.

Experimental Results and Empirical Analysis

The authors provide extensive benchmarking across their instance-level CIR dataset and multiple class-level legacy datasets. FreeDom consistently demonstrates superior or competitive performance compared to a suite of strong baselines and published state-of-the-art methods, including training-based, pseudo-triplet, and zero-shot approaches.

Key empirical findings include:

  • Substantial absolute performance gains: FreeDom achieves macro-mAP of 31.6% on the proposed instance-level task, which notably outperforms all prior zero-shot and training-based CIR methods, including large-scale generative approaches and domain-specialized models.
  • Strong per-category robustness: FreeDom outperforms specialized methods across most object and modifier categories, with particularly pronounced margins on highly compositional instances (fictional, product, technology).
  • Compact-yet-hard benchmark: The challenge of the dataset, as empirically evidenced, is highly competitive—a 3.7K-image hard-negative database is as difficult as 40M random images from LAION for strong unimodal fusion baselines.
  • Genuine modality composition: Experiments sweeping the text/image fusion weight show the proposed dataset yields synergistic gains for multimodal fusion, whereas legacy datasets often favor text-only matching.
  • Transparent, cumulative ablation: Each FreeDom component, from centering and projection to regularization and contextualization, is shown to incrementally and robustly enhance macro-mAP, affirming the necessity of multi-faceted representation treatment in practical CIR.

Hyperparameter Sensitivity

The paper presents an analysis of FreeDom's robustness to its core hyperparameters: the contrastive scaling parameter (α\alpha) in semantic projection, the number of PCA components (kk), and the Harris regularization weight (λ\lambda). Results indicate broad plateaus, supporting the method's transferability and low risk of overfitting. Figure 3

Figure 3

Figure 3

Figure 4: Sensitivity of FreeDom to hyperparameters: variation of mAP with respect to α\alpha, kk, and λ\lambda across multiple datasets.

Implications and Future Considerations

The contributions of this paper have implications for both practical application and theoretical development in multimodal retrieval:

  • Benchmark design: By addressing annotation ambiguity, hard-negative construction, and instance granularity, the dataset sets a new standard for CIR evaluation, discouraging unimodal shortcuts and supporting broader modifier diversity.
  • Training-free methods: FreeDom's architecture, relying on analytically derived projection and centering/normalization, suggests that retraining or finetuning large VLMs is not strictly necessary to attain strong CIR accuracy, especially when modal representations are explicitly treated for bias and compositionality.
  • Transfer and customization: The use of external corpora for semantic subspace projection opens further avenues for domain adaptation and task-specific tailoring in retrieval pipelines without full retraining.

Potential developments include leveraging generative methods to expand the coverage of rare modifier-instance pairs, adapting projection and contextualization for video or temporally evolving object instances, and integrating learned fusion mechanisms that extend beyond AND-based logic, especially in more entangled textual-visual tasks.

Conclusion

"Instance-Level Composed Image Retrieval" (2510.25387) advances CIR by formalizing an instance-centric evaluation benchmark featuring challenging negatives and by proposing FreeDom, a simple yet surprisingly effective, training-free CIR baseline that sets a new empirical standard. This work highlights the importance of data curation and explicit modality treatment for evaluating and developing VLM-based retrieval systems, and provides a principled foundation for compositional retrieval in both research and applied multimedia scenarios.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Knowledge Gaps

Knowledge gaps, limitations, and open questions

The paper makes notable contributions but leaves several concrete issues unresolved that future work can address:

  • [Dataset] Lack of a global, shared retrieval index: evaluation uses a separate database per instance, avoiding cross-instance distractors and not reflecting real-world, single-pool retrieval at scale. An open need is a unified benchmark where all instances share one large index to measure cross-instance confusion and scalability.
  • [Dataset] Potential CLIP-induced curation bias: hard negatives and candidate pools are mined with CLIP, which can bias the dataset toward CLIP’s embedding geometry and failure modes. It remains unclear how methods not aligned with CLIP are affected. A CLIP-agnostic negative-mining pipeline and bias analysis are needed.
  • [Dataset] Limited modification diversity per instance: each instance has only 1–5 text modifications (median 2), constraining linguistic and compositional variety. There is a gap in broader, fine-grained, and combinatorial modifications (e.g., multi-step edits, relational constraints, negations like “without people,” spatial constraints), and measuring performance under such richer queries.
  • [Dataset] Ambiguity and consistency of “same instance” labeling: the paper provides no inter-annotator agreement, adjudication protocol, or ambiguity analysis—especially for fictional characters or mass-produced products where “visually indistinguishable” is subjective. A formal definition, annotation guidelines, and reliability studies are missing.
  • [Dataset] False-negative risk from “unmarked = negative”: unselected images in the candidate pool are treated as negatives, which risks missing positives (or near-positives) and inflating precision. Quantifying and reducing residual false negatives is an open task.
  • [Dataset] Generalization beyond LAION: all images come from LAION, which is known to carry sampling, geographic, and content biases. The dataset’s representativeness, fairness, and robustness beyond LAION are not assessed. Cross-source validation and bias audits are needed.
  • [Dataset] Minimal coverage analysis: while the paper lists broad categories (landmarks, products, fictional, etc.), it lacks a principled analysis of category balance, geographic/cultural diversity, and long-tail coverage. Guidance on expansion to underrepresented instances and contexts is missing.
  • [Dataset] Overlap and leakage across instances: with per-instance databases, it is unclear whether near-duplicate images or visually indistinguishable depictions of different instances appear across separate indices and how this would impact evaluation in a shared index.
  • [Dataset] Claim of challenge comparable to 20–40M distractors is not fully specified for reproducibility (e.g., which distractors, how difficulty is calibrated). Standardized large-scale distractor sets and replication protocols are needed to validate this claim.
  • [Dataset] Reproducibility details of data release: the paper does not specify whether LAION IDs, selection scripts, annotation schemas, and seed queries/corpora are released. Without these, exact reproduction, extension, or auditing of the dataset is difficult.
  • [Evaluation] Single metric focus (macro-mAP): no reporting of top-k metrics (e.g., Recall@k), per-negative-type error rates (visual vs textual vs composed), or calibration/uncertainty metrics. A deeper diagnostic evaluation suite is needed to pinpoint failure modes.
  • [Evaluation] No multilingual assessment: queries and corpora appear to be English-only; the dataset and method’s cross-lingual generalization is untested. A multilingual benchmark split and evaluation protocols would enable broader applicability.
  • [Training] No standardized training split for instance-level CIR: while the dataset serves evaluation, many methods rely on training triplets. The lack of a high-quality, instance-level training set (or protocol for building one) continues to hinder learning-based approaches.
  • [Method: FreeDom] Sensitivity to automatically generated corpora (C+ and C−): object/style corpora are produced via ChatGPT; the impact of corpus quality, size, and domain match on performance is unquantified. Guidance on constructing robust, domain-adaptive corpora is missing.
  • [Method: FreeDom] Dependence on synthetic data for score normalization: min-based normalization uses statistics from prompts/images generated by Stable Diffusion. The sensitivity of results to the choice of generator, prompt distribution, and domain gap is not analyzed.
  • [Method: FreeDom] Projection design choices are underexplored: the effect of projection dimensionality k, contrastive weighting α, corpus composition, and alternative subspace estimation methods (e.g., supervised or self-supervised directions) is not systematically studied.
  • [Method: FreeDom] Fixed fusion with a Harris-like penalty: the multiplicative fusion and fixed λ may be suboptimal across queries or datasets. There is no adaptive or learned fusion mechanism, nor an analysis of per-query modality dominance detection and calibration.
  • [Method: FreeDom] Query expansion harms instance-level CIR in this dataset; yet no alternative expansion or feedback strategies (e.g., modality-aware, modification-aware, or pseudo-relevance feedback for text) are explored to recover benefits without introducing drift.
  • [Method: FreeDom] Limited modeling of text–image interactions: the approach uses independent similarities and late fusion, which may miss fine-grained cross-modal interactions that cross-attention or composition-specific encoders can capture. Exploring lightweight, training-free interaction mechanisms remains open.
  • [Method: FreeDom] Robustness and OOD behavior: the method and dataset avoid implausible modifications. Generalization to rare/novel or contradictory edits, hard negations, compositional chains (multi-step text edits), and adversarial phrasing is not evaluated.
  • [Method: FreeDom] Model generality across VLMs is untested: results are primarily with CLIP ViT-L/14; it is unknown how FreeDom transfers to other VLMs (e.g., SigLIP, BLIP2, ALIGN) and whether components (centering, projection, normalization) require re-tuning.
  • [Scalability] Large-scale runtime and memory at web scale: while the paper argues linear/sub-linear complexity and refers to FAISS, there is no empirical evaluation on truly massive indices (>100M) or analysis of latency/throughput trade-offs when applying query-time projection/contextualization.
  • [User personalization] The paper notes projection can be applied on the query side to enable customization but provides no studies on user- or domain-specific priors, learned or rule-based personalization, or mechanisms to adapt projections on-the-fly.
  • [Ethics and safety] No discussion of privacy, copyright, or sensitive content in LAION-derived images; nor of fairness across demographics and cultures in instance selection and modifications. Ethical guidelines and audits are absent.
  • [Benchmark extensibility] No protocol for adding new instances, new modification types, or cross-modal tasks (e.g., video or multi-image queries), and how to maintain comparability over time. A clear governance and versioning plan is missing.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 8 tweets with 20 likes about this paper.