Papers
Topics
Authors
Recent
Search
2000 character limit reached

Conditional Image Retrieval: CORE Benchmark

Updated 6 April 2026
  • Conditional Image Retrieval (CORE Benchmark) is a framework that retrieves images or specific regions based on reference exemplars combined with explicit text-based modifications.
  • It extends traditional retrieval by integrating region-level segmentation and open-set generalization, evaluated through metrics like Recall@K, Dice, and IoU.
  • The system employs advanced fusion of image and text encoders, efficient indexing structures, and automated triplet mining to achieve robust and scalable performance.

Conditional Image Retrieval (CORE Benchmark) designates a rigorous empirical and algorithmic framework for selecting images or objects from a large corpus, conditioned on both reference exemplars and explicit user-specified constraints or modifications. CORE benchmarks exemplify a new generation of retrieval systems that not only optimize for general similarity but also respond flexibly to compositional instructions—such as focusing, changing, or combining object-level and attribute-level cues—in both zero-shot and fine-grained scenarios. These benchmarks evolve prior “Composed Image Retrieval” (CIR) tasks into region- and transformation-specific evaluations, supporting open-set generalization, segmentation-based metrics, and robust protocol design (Vaze et al., 2023, Wang et al., 6 Aug 2025).

1. Formal Problem Definitions and Mathematical Framework

Conditional image retrieval extends traditional instance or content-based image retrieval by enforcing explicit conditioning. Given a retrieval database, a reference modality (image, region, or feature), and a structured constraint or modification, the goal is to identify (and in advanced variants, to segment) the image regions that satisfy the composed query.

General Conditional Image Similarity (GeneCIS):

The retrieval function is defined as:

f(IT;IR,c)Rf(I^T; I^R, c) \in \mathbb{R}

where ITI^T is a candidate (target) image, IRI^R is a reference image, and cc is a text condition. The approach learns encoders Φ:ImagesRd\Phi: \text{Images} \to \mathbb{R}^d, Ψ:TextRd\Psi: \text{Text} \to \mathbb{R}^d, and a combiner g(xR,e)g(x^R, e) producing the joint query vector, with the final similarity score given by:

f(IT;IR,c)=g(xR,e)xTf(I^T; I^R, c) = g(x^R, e) \cdot x^T

GeneCIS focuses on scenarios where the set of possible conditions cc is unbounded (“open-set”), forcing models to generalize beyond seen transformations or attribute-object combinations (Vaze et al., 2023).

Composed Object Retrieval (COR) and CORE:

Let Q=(IR,MR,T)Q = (I^R, M^R, T) denote a composed expression: ITI^T0 is a reference image, ITI^T1 a binary mask selecting object ITI^T2, and ITI^T3 a tuple of noun-free text transformations. Given a target image ITI^T4, the system must output a mask ITI^T5 segmenting the object(s) ITI^T6 satisfying the composite query:

ITI^T7

The CORE benchmark requires region-level, attribute-conditioned, and transformation-specific accuracy, relying on both recognition and localization (Wang et al., 6 Aug 2025).

2. Benchmark Construction, Tasks, and Evaluation Protocols

GeneCIS (Vaze et al., 2023) operationalizes conditional similarity with four core tasks and open-set attribute or object conditions:

  • Focus on Attribute (“same color”)
  • Change Attribute (“but in black”)
  • Focus on Object (“focus on refrigerator in the scene”)
  • Change Object (“same scene but add a ceiling”)

These tasks utilize well-controlled galleries (typically ITI^T810–15), each with a single positive example and carefully selected distractors (sharing either the scene or the conditioned element, but not both). GeneCIS uses Recall@K as the quantitative retrieval metric, with ITI^T9 for controlled galleries and IRI^R0 for global retrieval.

CORE and COR127K (Wang et al., 6 Aug 2025) advance the protocol to region-level queries:

  • Each triplet IRI^R1 defines a reference region, a set of attribute transformations, and the matching target segment.
  • 408 categories, 127,166 annotated triplets, divided into base and novel splits to facilitate compositional generalization.
  • Tasks include color, shape, texture, pose, orientation, and spatial transformations.
  • Evaluation is by mask overlap metrics: Dice coefficient, Intersection-over-Union (IoU), mean Absolute Error (MAE), with both per-instance and per-category statistics (mDice, mIoU).

By using noun-free transformation queries, CORE/ COR127K ensures relevance for open-vocabulary and unseen-category scenarios.

3. Model Architectures and Indexing Algorithms

GeneCIS: Models use separate image and text encoders (e.g., CLIP ViT), a Combiner network, and dot-product similarity. Key baselines:

  • Image Only: CLIP-image embedding nearest neighbor.
  • Text Only: CLIP-text embedding nearest neighbor with condition IRI^R2.
  • Image + Text: Average of reference and text embeddings.
  • Combiner(CIRR): Trained on CIRR and evaluated zero-shot.
  • Combiner (CC3M), i.e., “Ours”: Trained using mined triplets from CC3M, further boosting performance. Triplet mining involves extracting (Subject, Predicate, Object) tuples from captions and constructing matching triplets for contrastive learning.

CORE Model (Wang et al., 6 Aug 2025) (region-level, object-centric):

  • Reference Region Embedding (RRE): Combines SAM-based image features, reference mask, and multi-stage convolutional enhancements.
  • Adaptive Vision–Text Interaction (AVTI): Jointly attends to region- and instruction-level features using attention and gating mechanisms, outputting a prompt for segmentation.
  • Region-level Contrastive Loss: Maximizes similarity between composed query and true foreground, minimizes with respect to background.
  • Decoder: SAM head segments the object region in the target image.

Algorithmic Indexing (MosAIc):

  • Ball-trees, KD-trees, and RP trees enable sublinear conditional IRI^R3-NN retrieval by maintaining an inverted index (“conditional pruning”).
  • Theoretical bounds formalize the efficiency and coverage guarantees for arbitrary conditioned subsets (Hamilton et al., 2020).

4. Performance Results and Comparative Analysis

GeneCIS (Recall@1)

Task Image Only Text Only Image+Text Combiner(CIRR) Ours (CC3M)
Focus Attribute 17.7 10.2 15.6 15.1 19.0
Change Attribute 11.9 9.5 12.6 12.1 16.6
Focus Object 9.3 6.5 10.8 13.5 14.7
Change Object 7.2 6.2 11.3 15.4 16.8
Avg R@1 11.5 8.1 12.6 14.0 16.8

On MIT-States (composition retrieval), the mined-triplet approach delivers significant gains over CLIP-only baselines:

  • Recall@1: 15.8% with Ours (CC3M) vs. 13.3% with Image+Text.
  • On CIRR (Recall@10): Ours achieves 71.1% (Vaze et al., 2023).

CORE/COR127K:

  • Segmentation-only benchmark: no Recall@K; accuracy is purely by mask overlap. Reported metrics: Dice, IoU, mDice, mIoU (Wang et al., 6 Aug 2025).

5. Methodological Innovations

  • Triplet Mining (GeneCIS): Scaling training via automated extraction from image-caption datasets (CC3M), leveraging scene graph parsing and visual concreteness filtering, enables broad coverage of similarity notions without manual annotation (Vaze et al., 2023).
  • Region-level Retrieval (CORE): Leverages mask annotation and explicit region-based embeddings and fusion, outperforming prior image-level methods in both base and held-out (novel) categories (Wang et al., 6 Aug 2025).
  • Plug-and-Play Retrieval Enhancements: Methods such as IP-CIR use an LLM to infer proxy layouts, synthesize proxy images conditioned on both query and text via diffusion models (e.g., MIGC++), and fuse these features for robust hybrid retrieval. Performance advances are obtained by balancing image- and text-derived similarities (Li et al., 2024).
  • Index Structures (MosAIc): Efficient pruning and sublinear query time on conditional subsets, enabling scalable interaction with massive database partitions (Hamilton et al., 2020).

6. Implications for Benchmark and Protocol Design

GeneCIS and subsequent work offer several methodological lessons for CORE and future conditional image/object retrieval benchmarks:

  • Zero-shot and Open-set Generalization: Evaluations must test compositional and attribute-level generalization to out-of-distribution conditions.
  • Factorial Task Design: Benchmarks should encompass multiple axes (e.g., focus/change × attribute/object × category/appearance).
  • Small, Balanced Galleries: Restrict gallery size (IRI^R4) for precise Recall@K evaluation; compose distractors to control for scene- or condition-overlap.
  • Standardized Metrics: Use Recall@K for ranking in gallery tasks, segmentation overlap metrics (Dice, IoU) for region-based retrieval.
  • Scalable Data Mining: Adopt automated mining routines (scene-graph parsing, high-concreteness triples, composition filtering) for extensible and reproducible training sets.
  • Compatibility with Advanced Fusion and Proxy Techniques: The retrieval pipeline should modularly support compositional enhancements (e.g., IP-CIR proxy images) and robust cross-modal fusion (Li et al., 2024).

By synthesizing these methodological foundations, CORE benchmarks allow controlled, scalable, and reproducible assessment of conditional image and object retrieval—enabling precise measurement of progress in generalization, compositionality, and region-level reasoning in visual search.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Conditional Image Retrieval (CORE Benchmark).