CIRCO Dataset: Open-Domain CIR Benchmark
- CIRCO is an open-domain composed image retrieval dataset that supports zero-shot evaluation with multiple ground-truth targets per query.
- The dataset leverages balanced COCO supercategories and CLIP-based labeling, coupled with human-verified semantic annotations for precise relative captions.
- It incorporates rigorous evaluation protocols, including Recall@K and mAP@K, to benchmark model generalization across complex compositional queries.
CIRCO is an open-domain, zero-shot Composed Image Retrieval (CIR) benchmark dataset constructed to address the limitations of existing CIR datasets, particularly the lack of multiple ground truths and open-domain coverage. It targets retrieval scenarios where a query is specified as a tuple—a reference image and a relative caption—expressing desired differences in the target image. CIRCO comprises annotated queries drawn from real-world images, incorporates fine-grained semantic taxonomy, and enables robust, recall-based evaluation for zero-shot benchmarks. It is publicly available and intended for evaluating zero-shot CIR systems without supervised training on triplets (Baldrati et al., 2023, Agnolucci et al., 2024).
1. Motivation and Scope
Composed Image Retrieval (CIR) seeks to identify a target image that is similar to a reference image but altered according to a natural language "relative caption." Existing datasets predominantly focus on narrow domains such as fashion or birds, or provide only a single ground-truth target per query, resulting in significant false-negative rates. CIRCO ("Composed Image Retrieval on Common Objects in Context") was constructed to:
- Provide an open-domain CIR benchmark based on the COCO 2017 unlabeled image split.
- Reduce false negatives by encoding multiple, valid target images per query.
- Introduce and annotate a semantic taxonomy for the linguistic variations in the relative caption.
- Enable rigorous zero-shot CIR (ZS-CIR) evaluation with a validation set for metrics and a held-out test server (Agnolucci et al., 2024).
2. Dataset Collection, Annotation, and Structure
Source Images and Supercategory Balancing
CIRCO draws from 123,403 images in the COCO 2017 unlabeled split. To maximize diversity and balance, reference images are automatically labeled via CLIP (ViT-L/14) into 12 COCO supercategories: person, animal, sports, vehicle, food, accessory, electronic, kitchenware, furniture, indoor, outdoor, appliance. Query selection is balanced so that each supercategory is approximately equally represented.
Reference–Target Pairing and Captioning
For each randomly sampled reference image, the 50 nearest neighbors (by cosine similarity in the CLIP feature space, threshold < 0.92) are retrieved. Annotators either skip unsuitable references or select a visually similar but semantically describable neighbor as the primary target and then specify a “shared concept” (e.g., “dog playing outside”). The relative caption, following the template "Unlike the provided image, I want a photo of {shared concept} that...", must detail only the difference, with absolute or redundant mentions disallowed.
Multi-Ground-Truth Annotation
Annotators use candidates retrieved by SEARLE-XL (top 100 by model, top 50 by visual proximity) to verify all COCO index images satisfying the query. This process yields, per query, on average 4.53 ground-truth images (total 4,624 ground truths from 1,020 queries). Mode ground-truth count is 2, and the maximum for any query is 21.
Example Query Structure
- reference_image_id: 324589
- relative_caption: "Unlike the provided image, I want a photo of one boat next to a lake."
- shared_concept: "boat"
- semantic_aspects: ["cardinality", "spatial_relations"]
- ground_truth_ids: List of image IDs (Agnolucci et al., 2024)
Dataset Organization
| Statistic | Value |
|---|---|
| Index images | ≈120,000 (COCO 2017 unlabeled) |
| Queries | 1,020 |
| Validation queries | 220 |
| Test queries | 800 |
| Avg. ground truths/query | 4.53 |
| Min/Max GTs per query | 1 / 21 |
| Supercategories | 12 (even ~85 queries each) |
3. Linguistic Taxonomy and Semantic Annotation
Each query's relative caption is annotated with one or more of nine non-exclusive semantic aspects, following CIRR conventions:
- Cardinality: Specifies a number ("one cat")
- Addition: Adds objects/elements ("also includes a boat")
- Negation: Removes elements ("without windows")
- Direct Addressing: Commands action/attribute ("playing fetch")
- Compare / Change: Substitutes or replaces ("instead of X, show Y")
- Comparative Statement: Uses comparative adjectives/adverbs ("larger")
- Statement with Conjunction: Employs "and"/"or" ("dog and frisbee")
- Spatial Relations / Background: Specifies position/context ("next to", "in the background")
- Viewpoint: Camera/viewpoint specification ("side view")
Labels are assigned if a majority of annotators agree (>50%). Coverage statistics (comparative across CIRCO, CIRR, and FashionIQ) reveal CIRCO's relative richness in conjunction, addition, and background-related queries.
4. Evaluation Protocols and Metrics
CIRCO supports multiple ground-truth evaluation, enabling metrics that are not possible with single-target datasets. Key metrics include:
- Recall@K: Counts a query as correctly retrieved if any of its ground truths is found in the top K.
- mean Average Precision at K (mAP@K):
where is the precision at rank k for query n and the number of valid ground truths for query n.
Zero-shot CIR evaluation requires no training on CIRCO queries or ground truths; pre-training or use of the COCO index is permitted if no CIRCO triplets are used.
5. Data Format, Accessibility, and Usage
The dataset is distributed under an open research license and consists of the following main components:
- /images/COCO/: Index of COCO images
- /queries/validation.json: 220 annotated public queries with all ground-truth image IDs
- /queries/test.json: 800 queries without ground-truth IDs (evaluation-only)
- /ground_truth/validation.json: Query-to-ground-truth ID mapping
Query entries (JSON schema):
1 2 3 4 5 6 7 8 |
{
"query_id": int,
"reference_image_id": int,
"relative_caption": string,
"shared_concept": string,
"semantic_aspects": [list of aspect names],
"ground_truth_ids": [list of ints] // validation only
} |
Code, dataset, and official evaluation scripts are hosted at https://github.com/miccunifi/SEARLE, with server-based test evaluation at https://circo.micc.unifi.it/ (Baldrati et al., 2023, Agnolucci et al., 2024).
6. Applications and Benchmarking Role
CIRCO provides a rigorously annotated, open-domain testbed for CIR and zero-shot retrieval algorithms. The dataset enables the evaluation of:
- Generalization of zero-shot CIR architectures, including SEARLE (zero-Shot composEd imAge Retrieval with textuaL invErsion) and iSEARLE variants.
- Fine-grained retrieval settings involving object composition, spatial relationships, comparative modifications, and semantic transformations (Agnolucci et al., 2024).
- Analysis of model performance across semantic categories.
A plausible implication is that CIRCO's structure, especially its multi-ground-truth annotation and semantic taxonomy, may lead to new evaluation standards for CIR beyond domain-specific or single-target paradigms.
7. Limitations and Future Directions
CIRCO, while offering broad coverage and multi-target annotation, provides only validation set ground-truths for public use; test set labels remain internal to prevent evaluation leakage. No explicit training set is offered by design, aligning with the zero-shot paradigm. All ground-truth annotation has undergone systematic human verification, supported by state-of-the-art retrieval models, but coverage, estimated at >92%, remains slightly incomplete. This suggests possible further gains in recall may be achievable with additional annotation effort or improved retrieval methods (Baldrati et al., 2023, Agnolucci et al., 2024).