Crossmodal-3600: A Multilingual Captioning Benchmark

Updated 29 January 2026

Crossmodal-3600 is a large-scale, multilingual image-captioning benchmark offering 3,600 images with human-written captions across 36 languages.
The dataset employs a dual-phase annotation protocol that ensures natural language quality and minimizes translation artifacts, with over 90% high-quality ratings in most languages.
Rigorous evaluation using CIDEr and correlation metrics confirms strong alignment between automated scores and human judgments for reliable model ranking.

Crossmodal-3600 (XM3600) is a large-scale, geographically and linguistically diverse image-captioning evaluation benchmark designed for massively multilingual research. Released under the CC-BY 4.0 license, XM3600 delivers high-quality, human-written reference captions for 3,600 images in 36 languages, addressing a critical bottleneck—namely, the historical lack of rigorous evaluation resources for multilingual image captioning. The benchmark defines annotation protocols, establishes formal evaluation metrics, and empirically demonstrates high correlation with human judgments, positioning it as the recommended “gold” standard for both development and comparative assessment in the multilingual, multimodal domain (Thapliyal et al., 2022).

1. Dataset Composition and Multilingual Scope

XM3600 consists of 3,600 images, each annotated with target captions in up to 36 languages. The annotation schema strives for a minimum of 2 captions per image per language, except for Bengali (replication of 1) and Māori (average replication of approximately 1.3), yielding a total of 261,375 captions across the entire dataset. Images are systematically sourced for geographic diversity, with selection based on embedded GPS EXIF from the Open Images dataset. A greedy sampling process prioritizes images from regions where a given language is spoken, with escalation strategies encompassing official-country, continental, and global fallback as needed to reach the per-language quota. The “region level” provenance of each image is preserved for downstream subsetting and geographical analysis.

Coverage comprises a spectrum of genealogical language families and regions:

Afro-Asiatic (Semitic): Arabic, Hebrew
Austronesian (Malay–Polynesian): Indonesian, Filipino, Māori
Niger-Congo (Bantu): Swahili
Austroasiatic: Vietnamese
Dravidian: Telugu
Tai-Kadai: Thai
Quechuan: Cusco Quechua
Sino-Tibetan: Chinese (Simplified)
Japonic: Japanese
Koreanic: Korean
Turkic: Turkish
Uralic: Finnish, Hungarian
Indo-European: Encompasses Germanic, Romance, Slavic, and Indo-Iranian branches

Regional coverage includes Europe, Asia, Africa, Oceania, and the Americas. This ensures the data supports evaluation for a diverse array of language communities.

2. Annotation Protocol and Quality Assurance

The annotation methodology is specifically designed to maximize linguistic naturalness and minimize translation artifacts. Captions are provided natively in each target language by annotators fluent or native in that language and proficient in English. The workflow comprises two in-house phases per batch ( $N=15$ images):

Quality-Rating Phase: Annotators rate English auto-generated captions (using an mT5+ViT baseline) on a 5-point scale (Excellent–Not-Enough-Info), referencing a detailed style guide.
Generation Phase: Annotators independently compose captions in the target language, drawing only on the visual content, not on the previously seen English text.

This batch process ensures “memory overwrite,” reducing literal translation effects. Annotation guidelines emphasize “Visible” descriptions—i.e., surface-level, denotative sentences faithful to observable content, precluding inference or storytelling.

For quality control, pilot rounds (≈150 images per language) precede main annotation. Spot checks and refining guidelines drive down low-quality rates. The main annotation assigns two independent annotators per image (except Bengali and Māori). Verification is performed on 600 randomly selected captions per language, re-rated by three annotators using the same scale, with quality aggregated by median. In most languages, the proportion of “Good+” (Good or Excellent) ratings exceeds 90%.

3. Evaluation Metrics and Formal Definitions

XM3600 enables benchmarking with multiple established automatic metrics for image captioning:

BLEU-N: Computes geometric mean of clipped n-gram precisions with a brevity penalty.

$p_n = \frac{\sum_{n\text{-grams}\in C} \min(\text{count}_C, \max_j \text{count}_{R_j})}{\sum_{n\text{-grams}\in C} \text{count}_C}$

$BP = \begin{cases} 1 & |C| > r \ \exp(1 - r/|C|) & |C| \le r \end{cases},\quad r = \arg\min_j |R_j|$

$\mathrm{BLEU} = BP \cdot \exp\Bigl(\sum_{n=1}^N w_n \ln p_n\Bigr)$

METEOR: Aligns unigrams using exact, stem, and synonym matches with an $F_\alpha$ -score and fragmentation penalty.

$P = \frac{|\,\text{aligned words}\,|}{|C|},\quad R = \frac{|\,\text{aligned words}\,|}{\max_j|R_j|},\quad F_{\alpha} = \frac{P\,R}{\alpha P + (1-\alpha)R}$

CIDEr: Applies TF–IDF n-gram weighting to compute cosine similarity between candidate and human reference captions.

$\mathrm{CIDEr}_n(c,R) = \frac{1}{|R|}\sum_{j} \frac{g_n(c)\cdot g_n(R_j)}{\|g_n(c)\|\;\|g_n(R_j)\|},\quad \mathrm{CIDEr} = \sum_{n=1}^4 w_n\,\mathrm{CIDEr}_n,\ w_n = 1/4$

SPICE: Converts captions into scene graphs, scoring based on F1 agreement over object/attribute/relation tuples.

$P = \frac{|G_C\cap G_R|}{|G_C|},\; R = \frac{|G_C\cap G_R|}{|G_R|},\; \mathrm{SPICE} = \frac{2PR}{P+R}$

Correlation with human judgments is quantified via:

Pearson’s $r$
Spearman’s $\rho$
Kendall’s $\tau$ Computed on the difference in model metric scores ( $\Delta M$ ) and the corresponding human preference ( $\Delta H$ ).

4. Benchmarking Results and Metric Correlation

The principal benchmarking experiment trains four transformer-based multilingual captioners (variants of mT5 with different ViT backbones) on large-scale, machine-translated datasets (CC3M-35L, COCO-35L), followed by evaluation on XM3600. Human preference is assessed on a 600-image subset (XM₆₀₀), using a 7-point comparative scale.

Per-language CIDEr $_{\text{gold}}$ scores for the best model (BB+CC) range from 0.175 (Hungarian) to 0.584 (English), with a mean around 0.34. A selection of results is tabulated below:

Language (ISO)	CIDEr $_{\text{XM}}$
en	0.584
es	0.425
fr	0.410
nl	0.441
zh	0.202
ru	0.194
hi	0.197

Correlation between $\Delta\mathrm{CIDEr}_\text{gold}$ and human preference ( $\Delta H$ ) is high: Pearson $r=0.88$ , Spearman $\rho=0.87$ , Kendall $\tau=0.69$ (130 points, all languages). In contrast, metrics computed on “silver” (machine translated) references yield substantially weaker, and in some instances, negative correlations, indicating that model ranking can be reversed when using these lower quality references.

The metric comparison analysis identifies CIDEr with gold-standard (XM3600) references as the most reliable automatic proxy for human evaluation across languages and models. BLEU, METEOR, and SPICE were not empirically benchmarked in the work, but prior results suggest similar sensitivity to reference quality; CIDEr on XM3600 is therefore recommended as the primary evaluation metric.

5. Applications, Subsetting, and Pipeline Integration

XM3600 is intended to serve as the definitive dev/test “gold” reference for massively multilingual image captioning. Recommended integration strategies include:

Evaluation Pipelines: Compute CIDEr scores per language against human reference captions for both dev and test sets.
Geographic Subsetting: Utilize “region-level” metadata to evaluate models on specific locales, enabling fine-grained performance analyses (e.g., Africa, Latin America).
Model Selection: Replace costly human A/B evaluations with $\Delta \mathrm{CIDEr}_\text{gold}$ as a reliable model ranking tool during hyperparameter tuning and architecture search.
Continuous Expansion: Extend XM3600 methodology to new languages or domains by following its annotation protocol, including style induction, rating phase, and controlled caption generation.

6. Limitations and Recommendations

Despite encompassing 36 languages, XM3600 does not exhaustively represent the world's linguistic diversity; ongoing annotation in additional languages is needed for broader global relevance. The current 3,600-image corpus offers wide regional sampling but may not fully capture underrepresented environments (e.g., highly rural, medical, or industrial domains). Metric-based evaluation is sensitive to optimization for the metric itself, necessitating periodic human recalibration for very small model differences. The dataset is based on Open Images/Flickr data; domain shift may induce different model behaviors in other contexts.

A plausible implication is that best practices entail periodically supplementing automated metric evaluations with targeted human checks, especially during significant domain adaptation or for languages and settings not covered in the original dataset.

7. Impact and Future Directions

XM3600 establishes a methodological and empirical foundation for rigorous, reproducible evaluation in massively multilingual, multimodal captioning. The benchmark’s combination of native-language, “Visible” annotation, strong quality control, and validated human–metric alignment positions it as the current reference standard for the field. Ongoing adoption, broadening of language and image coverage, and adaptation to novel domains are anticipated future trajectories. The methodology enables continuous evolution of evaluation resources as the multilingual, multimodal research landscape develops further (Thapliyal et al., 2022).

Markdown Report Issue Upgrade to Chat

References (1)

Crossmodal-3600: A Massively Multilingual Multimodal Evaluation Dataset (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Crossmodal-3600.