Conditional Image Retrieval: CORE Benchmark
- Conditional Image Retrieval (CORE Benchmark) is a framework that retrieves images or specific regions based on reference exemplars combined with explicit text-based modifications.
- It extends traditional retrieval by integrating region-level segmentation and open-set generalization, evaluated through metrics like Recall@K, Dice, and IoU.
- The system employs advanced fusion of image and text encoders, efficient indexing structures, and automated triplet mining to achieve robust and scalable performance.
Conditional Image Retrieval (CORE Benchmark) designates a rigorous empirical and algorithmic framework for selecting images or objects from a large corpus, conditioned on both reference exemplars and explicit user-specified constraints or modifications. CORE benchmarks exemplify a new generation of retrieval systems that not only optimize for general similarity but also respond flexibly to compositional instructions—such as focusing, changing, or combining object-level and attribute-level cues—in both zero-shot and fine-grained scenarios. These benchmarks evolve prior “Composed Image Retrieval” (CIR) tasks into region- and transformation-specific evaluations, supporting open-set generalization, segmentation-based metrics, and robust protocol design (Vaze et al., 2023, Wang et al., 6 Aug 2025).
1. Formal Problem Definitions and Mathematical Framework
Conditional image retrieval extends traditional instance or content-based image retrieval by enforcing explicit conditioning. Given a retrieval database, a reference modality (image, region, or feature), and a structured constraint or modification, the goal is to identify (and in advanced variants, to segment) the image regions that satisfy the composed query.
General Conditional Image Similarity (GeneCIS):
The retrieval function is defined as:
where is a candidate (target) image, is a reference image, and is a text condition. The approach learns encoders , , and a combiner producing the joint query vector, with the final similarity score given by:
GeneCIS focuses on scenarios where the set of possible conditions is unbounded (“open-set”), forcing models to generalize beyond seen transformations or attribute-object combinations (Vaze et al., 2023).
Composed Object Retrieval (COR) and CORE:
Let denote a composed expression: 0 is a reference image, 1 a binary mask selecting object 2, and 3 a tuple of noun-free text transformations. Given a target image 4, the system must output a mask 5 segmenting the object(s) 6 satisfying the composite query:
7
The CORE benchmark requires region-level, attribute-conditioned, and transformation-specific accuracy, relying on both recognition and localization (Wang et al., 6 Aug 2025).
2. Benchmark Construction, Tasks, and Evaluation Protocols
GeneCIS (Vaze et al., 2023) operationalizes conditional similarity with four core tasks and open-set attribute or object conditions:
- Focus on Attribute (“same color”)
- Change Attribute (“but in black”)
- Focus on Object (“focus on refrigerator in the scene”)
- Change Object (“same scene but add a ceiling”)
These tasks utilize well-controlled galleries (typically 810–15), each with a single positive example and carefully selected distractors (sharing either the scene or the conditioned element, but not both). GeneCIS uses Recall@K as the quantitative retrieval metric, with 9 for controlled galleries and 0 for global retrieval.
CORE and COR127K (Wang et al., 6 Aug 2025) advance the protocol to region-level queries:
- Each triplet 1 defines a reference region, a set of attribute transformations, and the matching target segment.
- 408 categories, 127,166 annotated triplets, divided into base and novel splits to facilitate compositional generalization.
- Tasks include color, shape, texture, pose, orientation, and spatial transformations.
- Evaluation is by mask overlap metrics: Dice coefficient, Intersection-over-Union (IoU), mean Absolute Error (MAE), with both per-instance and per-category statistics (mDice, mIoU).
By using noun-free transformation queries, CORE/ COR127K ensures relevance for open-vocabulary and unseen-category scenarios.
3. Model Architectures and Indexing Algorithms
GeneCIS: Models use separate image and text encoders (e.g., CLIP ViT), a Combiner network, and dot-product similarity. Key baselines:
- Image Only: CLIP-image embedding nearest neighbor.
- Text Only: CLIP-text embedding nearest neighbor with condition 2.
- Image + Text: Average of reference and text embeddings.
- Combiner(CIRR): Trained on CIRR and evaluated zero-shot.
- Combiner (CC3M), i.e., “Ours”: Trained using mined triplets from CC3M, further boosting performance. Triplet mining involves extracting (Subject, Predicate, Object) tuples from captions and constructing matching triplets for contrastive learning.
CORE Model (Wang et al., 6 Aug 2025) (region-level, object-centric):
- Reference Region Embedding (RRE): Combines SAM-based image features, reference mask, and multi-stage convolutional enhancements.
- Adaptive Vision–Text Interaction (AVTI): Jointly attends to region- and instruction-level features using attention and gating mechanisms, outputting a prompt for segmentation.
- Region-level Contrastive Loss: Maximizes similarity between composed query and true foreground, minimizes with respect to background.
- Decoder: SAM head segments the object region in the target image.
Algorithmic Indexing (MosAIc):
- Ball-trees, KD-trees, and RP trees enable sublinear conditional 3-NN retrieval by maintaining an inverted index (“conditional pruning”).
- Theoretical bounds formalize the efficiency and coverage guarantees for arbitrary conditioned subsets (Hamilton et al., 2020).
4. Performance Results and Comparative Analysis
GeneCIS (Recall@1)
| Task | Image Only | Text Only | Image+Text | Combiner(CIRR) | Ours (CC3M) |
|---|---|---|---|---|---|
| Focus Attribute | 17.7 | 10.2 | 15.6 | 15.1 | 19.0 |
| Change Attribute | 11.9 | 9.5 | 12.6 | 12.1 | 16.6 |
| Focus Object | 9.3 | 6.5 | 10.8 | 13.5 | 14.7 |
| Change Object | 7.2 | 6.2 | 11.3 | 15.4 | 16.8 |
| Avg R@1 | 11.5 | 8.1 | 12.6 | 14.0 | 16.8 |
On MIT-States (composition retrieval), the mined-triplet approach delivers significant gains over CLIP-only baselines:
- Recall@1: 15.8% with Ours (CC3M) vs. 13.3% with Image+Text.
- On CIRR (Recall@10): Ours achieves 71.1% (Vaze et al., 2023).
CORE/COR127K:
- Segmentation-only benchmark: no Recall@K; accuracy is purely by mask overlap. Reported metrics: Dice, IoU, mDice, mIoU (Wang et al., 6 Aug 2025).
5. Methodological Innovations
- Triplet Mining (GeneCIS): Scaling training via automated extraction from image-caption datasets (CC3M), leveraging scene graph parsing and visual concreteness filtering, enables broad coverage of similarity notions without manual annotation (Vaze et al., 2023).
- Region-level Retrieval (CORE): Leverages mask annotation and explicit region-based embeddings and fusion, outperforming prior image-level methods in both base and held-out (novel) categories (Wang et al., 6 Aug 2025).
- Plug-and-Play Retrieval Enhancements: Methods such as IP-CIR use an LLM to infer proxy layouts, synthesize proxy images conditioned on both query and text via diffusion models (e.g., MIGC++), and fuse these features for robust hybrid retrieval. Performance advances are obtained by balancing image- and text-derived similarities (Li et al., 2024).
- Index Structures (MosAIc): Efficient pruning and sublinear query time on conditional subsets, enabling scalable interaction with massive database partitions (Hamilton et al., 2020).
6. Implications for Benchmark and Protocol Design
GeneCIS and subsequent work offer several methodological lessons for CORE and future conditional image/object retrieval benchmarks:
- Zero-shot and Open-set Generalization: Evaluations must test compositional and attribute-level generalization to out-of-distribution conditions.
- Factorial Task Design: Benchmarks should encompass multiple axes (e.g., focus/change × attribute/object × category/appearance).
- Small, Balanced Galleries: Restrict gallery size (4) for precise Recall@K evaluation; compose distractors to control for scene- or condition-overlap.
- Standardized Metrics: Use Recall@K for ranking in gallery tasks, segmentation overlap metrics (Dice, IoU) for region-based retrieval.
- Scalable Data Mining: Adopt automated mining routines (scene-graph parsing, high-concreteness triples, composition filtering) for extensible and reproducible training sets.
- Compatibility with Advanced Fusion and Proxy Techniques: The retrieval pipeline should modularly support compositional enhancements (e.g., IP-CIR proxy images) and robust cross-modal fusion (Li et al., 2024).
By synthesizing these methodological foundations, CORE benchmarks allow controlled, scalable, and reproducible assessment of conditional image and object retrieval—enabling precise measurement of progress in generalization, compositionality, and region-level reasoning in visual search.