Papers
Topics
Authors
Recent
2000 character limit reached

CIRCO Dataset: Open-Domain CIR Benchmark

Updated 4 January 2026
  • CIRCO is an open-domain composed image retrieval dataset that supports zero-shot evaluation with multiple ground-truth targets per query.
  • The dataset leverages balanced COCO supercategories and CLIP-based labeling, coupled with human-verified semantic annotations for precise relative captions.
  • It incorporates rigorous evaluation protocols, including Recall@K and mAP@K, to benchmark model generalization across complex compositional queries.

CIRCO is an open-domain, zero-shot Composed Image Retrieval (CIR) benchmark dataset constructed to address the limitations of existing CIR datasets, particularly the lack of multiple ground truths and open-domain coverage. It targets retrieval scenarios where a query is specified as a tuple—a reference image and a relative caption—expressing desired differences in the target image. CIRCO comprises annotated queries drawn from real-world images, incorporates fine-grained semantic taxonomy, and enables robust, recall-based evaluation for zero-shot benchmarks. It is publicly available and intended for evaluating zero-shot CIR systems without supervised training on triplets (Baldrati et al., 2023, Agnolucci et al., 2024).

1. Motivation and Scope

Composed Image Retrieval (CIR) seeks to identify a target image that is similar to a reference image but altered according to a natural language "relative caption." Existing datasets predominantly focus on narrow domains such as fashion or birds, or provide only a single ground-truth target per query, resulting in significant false-negative rates. CIRCO ("Composed Image Retrieval on Common Objects in Context") was constructed to:

  • Provide an open-domain CIR benchmark based on the COCO 2017 unlabeled image split.
  • Reduce false negatives by encoding multiple, valid target images per query.
  • Introduce and annotate a semantic taxonomy for the linguistic variations in the relative caption.
  • Enable rigorous zero-shot CIR (ZS-CIR) evaluation with a validation set for metrics and a held-out test server (Agnolucci et al., 2024).

2. Dataset Collection, Annotation, and Structure

Source Images and Supercategory Balancing

CIRCO draws from 123,403 images in the COCO 2017 unlabeled split. To maximize diversity and balance, reference images are automatically labeled via CLIP (ViT-L/14) into 12 COCO supercategories: person, animal, sports, vehicle, food, accessory, electronic, kitchenware, furniture, indoor, outdoor, appliance. Query selection is balanced so that each supercategory is approximately equally represented.

Reference–Target Pairing and Captioning

For each randomly sampled reference image, the 50 nearest neighbors (by cosine similarity in the CLIP feature space, threshold < 0.92) are retrieved. Annotators either skip unsuitable references or select a visually similar but semantically describable neighbor as the primary target and then specify a “shared concept” (e.g., “dog playing outside”). The relative caption, following the template "Unlike the provided image, I want a photo of {shared concept} that...", must detail only the difference, with absolute or redundant mentions disallowed.

Multi-Ground-Truth Annotation

Annotators use candidates retrieved by SEARLE-XL (top 100 by model, top 50 by visual proximity) to verify all COCO index images satisfying the query. This process yields, per query, on average 4.53 ground-truth images (total 4,624 ground truths from 1,020 queries). Mode ground-truth count is 2, and the maximum for any query is 21.

Example Query Structure

  • reference_image_id: 324589
  • relative_caption: "Unlike the provided image, I want a photo of one boat next to a lake."
  • shared_concept: "boat"
  • semantic_aspects: ["cardinality", "spatial_relations"]
  • ground_truth_ids: List of image IDs (Agnolucci et al., 2024)

Dataset Organization

Statistic Value
Index images ≈120,000 (COCO 2017 unlabeled)
Queries 1,020
Validation queries 220
Test queries 800
Avg. ground truths/query 4.53
Min/Max GTs per query 1 / 21
Supercategories 12 (even ~85 queries each)

3. Linguistic Taxonomy and Semantic Annotation

Each query's relative caption is annotated with one or more of nine non-exclusive semantic aspects, following CIRR conventions:

  1. Cardinality: Specifies a number ("one cat")
  2. Addition: Adds objects/elements ("also includes a boat")
  3. Negation: Removes elements ("without windows")
  4. Direct Addressing: Commands action/attribute ("playing fetch")
  5. Compare / Change: Substitutes or replaces ("instead of X, show Y")
  6. Comparative Statement: Uses comparative adjectives/adverbs ("larger")
  7. Statement with Conjunction: Employs "and"/"or" ("dog and frisbee")
  8. Spatial Relations / Background: Specifies position/context ("next to", "in the background")
  9. Viewpoint: Camera/viewpoint specification ("side view")

Labels are assigned if a majority of annotators agree (>50%). Coverage statistics (comparative across CIRCO, CIRR, and FashionIQ) reveal CIRCO's relative richness in conjunction, addition, and background-related queries.

4. Evaluation Protocols and Metrics

CIRCO supports multiple ground-truth evaluation, enabling metrics that are not possible with single-target datasets. Key metrics include:

  • Recall@K: Recall@K=1Nn=1N1[kK:reln,k=1]\text{Recall@}K = \frac{1}{N} \sum_{n=1}^N \mathbb{1}[\exists k \leq K : \text{rel}_{n,k} = 1] Counts a query as correctly retrieved if any of its ground truths is found in the top K.
  • mean Average Precision at K (mAP@K):

mAP@K=1Nn=1N1min(K,Gn)k=1K(Pn(k)×reln,k)\mathrm{mAP}@K = \frac{1}{N} \sum_{n=1}^N \frac{1}{\min(K, G_n)} \sum_{k=1}^K (P_n(k)\times\text{rel}_{n,k})

where Pn(k)P_n(k) is the precision at rank k for query n and GnG_n the number of valid ground truths for query n.

Zero-shot CIR evaluation requires no training on CIRCO queries or ground truths; pre-training or use of the COCO index is permitted if no CIRCO triplets are used.

5. Data Format, Accessibility, and Usage

The dataset is distributed under an open research license and consists of the following main components:

  • /images/COCO/: Index of COCO images
  • /queries/validation.json: 220 annotated public queries with all ground-truth image IDs
  • /queries/test.json: 800 queries without ground-truth IDs (evaluation-only)
  • /ground_truth/validation.json: Query-to-ground-truth ID mapping

Query entries (JSON schema):

1
2
3
4
5
6
7
8
{
  "query_id": int,
  "reference_image_id": int,
  "relative_caption": string,
  "shared_concept": string,
  "semantic_aspects": [list of aspect names],
  "ground_truth_ids": [list of ints] // validation only
}

Code, dataset, and official evaluation scripts are hosted at https://github.com/miccunifi/SEARLE, with server-based test evaluation at https://circo.micc.unifi.it/ (Baldrati et al., 2023, Agnolucci et al., 2024).

6. Applications and Benchmarking Role

CIRCO provides a rigorously annotated, open-domain testbed for CIR and zero-shot retrieval algorithms. The dataset enables the evaluation of:

  • Generalization of zero-shot CIR architectures, including SEARLE (zero-Shot composEd imAge Retrieval with textuaL invErsion) and iSEARLE variants.
  • Fine-grained retrieval settings involving object composition, spatial relationships, comparative modifications, and semantic transformations (Agnolucci et al., 2024).
  • Analysis of model performance across semantic categories.

A plausible implication is that CIRCO's structure, especially its multi-ground-truth annotation and semantic taxonomy, may lead to new evaluation standards for CIR beyond domain-specific or single-target paradigms.

7. Limitations and Future Directions

CIRCO, while offering broad coverage and multi-target annotation, provides only validation set ground-truths for public use; test set labels remain internal to prevent evaluation leakage. No explicit training set is offered by design, aligning with the zero-shot paradigm. All ground-truth annotation has undergone systematic human verification, supported by state-of-the-art retrieval models, but coverage, estimated at >92%, remains slightly incomplete. This suggests possible further gains in recall may be achievable with additional annotation effort or improved retrieval methods (Baldrati et al., 2023, Agnolucci et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to CIRCO Dataset.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube