NeuCLIRBench: Unified IR Evaluation Suite
- NeuCLIRBench is a unified evaluation suite for information retrieval, covering monolingual, cross-language, and multilingual tasks with extensive newswire and technical corpora.
- The benchmark integrates sparse and dense retrieval models along with neural rerankers, achieving high nDCG and ARGUE scores across diverse IR evaluation tracks.
- NeuCLIRBench supports reproducible and extensible research, fostering innovation in neural architectures, dynamic fusion strategies, and report generation techniques.
NeuCLIRBench is a unified, large-scale evaluation suite for monolingual, cross-language, and multilingual information retrieval, derived from the TREC NeuCLIR tracks of 2022–2024. It encompasses newswire and technical document collections in Chinese, Persian, and Russian (with parallel English translations), deep relevance judgments for several hundred topics, and robust neural and hybrid baseline systems. NeuCLIRBench supports reproducible research spanning monolingual IR, cross-language IR (CLIR), multilingual IR (MLIR), and retrieval-augmented report generation, forming a durable platform for advancing retrieval techniques in the era of neural architectures and LLMs (Lawrie et al., 18 Nov 2025, Lawrie et al., 17 Sep 2025, Lawrie et al., 11 Apr 2024, Lin et al., 2023).
1. Corpus Composition and Scope
NeuCLIRBench’s test collections comprise three major newswire corpora and a technical abstracts set. The newswire data are sourced from CommonCrawl News (August 2016–July 2021) and are processed for language identification, deduplication, and length filtering, yielding approximately:
| Collection | Language(s) | Domain | # Documents |
|---|---|---|---|
| Chinese News | Chinese | Newswire | ~3.18M |
| Persian News | Persian | Newswire | ~2.23M |
| Russian News | Russian | Newswire | ~4.63M |
| Technical Abstracts | Chinese | Scientific / Tech | ~396,209 |
Each non-English document has a corresponding English translation, generated by a Sockeye 2 Transformer MT system trained on large-scale public bitext (e.g., 127M Chinese–English sentence pairs), targeting BLEU ~33–40 on FLORES/NTREX. The aggregate corpus spans approximately 10 million unique documents, providing scale comparable to production IR scenarios (Lawrie et al., 18 Nov 2025, Lawrie et al., 17 Sep 2025).
2. Task Taxonomy and Input/Output Protocols
NeuCLIRBench defines four principal task types:
- News CLIR: English queries retrieve documents in a single foreign language (Chinese, Persian, or Russian); also supports monolingual retrieval using human-translated queries.
- News MLIR: English queries retrieve a ranked list of documents from the pooled, multilingual union (Chinese ∪ Persian ∪ Russian news), outputting a single cross-language ranking.
- Report Generation: Given a structured “report request” (problem statement and background in English), systems return an English summary with inline citations to supporting news documents from a specified language collection; summaries are capped at a character budget and evaluated for factual support.
- Technical Documents CLIR: English queries target Chinese scientific abstracts, with both ad-hoc CLIR (using English queries) and monolingual settings (using translated queries).
The standard output for retrieval tasks is a ranked list of up to 1,000 document identifiers per topic. For report generation, systems output an English report with sentences containing up to two cited document IDs, all drawn from a single news dataset (Lawrie et al., 17 Sep 2025, Lawrie et al., 11 Apr 2024).
3. Query and Relevance-Judgment Construction
NeuCLIRBench’s topics are derived from TREC NeuCLIR 2022–2024, ensuring high quality and diverse coverage:
- News and MLIR topics: Authored either by bilingual NIST assessor pairs (paired design) or by English-only coordinators supported by CLIR tools. Topics are validated to guarantee sufficient relevant documents (≥3 per query/language or language union for MLIR).
- Technical topics: Authored by STEM‐literate, Chinese‐fluent graduate students, emphasizing terminological specificity; all topics are reviewed and translated.
- Relevance judgments: Multi-assessor pooling is conducted from prioritized team runs, with a four-point original grading (3: Very Valuable, 2: Somewhat Valuable, 1: Marginal, 0: Non-Relevant), mapped to a 3-point qrel scale ({3→3,2→1,1→0,0→0}) for evaluations.
- Pooling procedures: Pools are constructed from top submissions and baseline runs to maintain coverage and fairness. Topic-drop rules prune uninformative or undersupported queries.
For report generation, “nugget” annotation protocols are applied: assessors enumerate atomic factoid requirements per request, annotate gold reports, and link each retrieved sentence/citation to nuggets with Full/Partial/No support judgments. This infrastructure enables discriminative, fine-grained report evaluation (Lawrie et al., 17 Sep 2025, Lawrie et al., 18 Nov 2025).
4. Baseline Architectures and Participant Systems
NeuCLIRBench includes robust, multi-stage neural pipelines and baseline fusion runs:
- Sparse-literal retrieval: BM25 (with/without RM3 query expansion), PATAPSCO, and SPLADE++ (learned-sparse) models, with both query translation (QT) and document translation (DT) variants.
- Dense retrieval: ColBERT-X, PLAID-X, Qwen3-8B Embed, and xDPR (XLM-R based bi-encoders) encode text for cross-lingual semantic matching.
- Fusion: Reciprocal Rank Fusion (RRF, k=60) combines diverse first-stage runs to exploit complementary strengths.
- Reranking: Cross-encoder and generative rerankers (e.g., monoT5-large, mT5-XXL, GPT-4, Claude) re-score candidate lists, with final-stage LLM re-ranking yielding best overall nDCG@20 and AP metrics.
- Report Generation: Best runs employ GPT-4 for summary composition, augmented with explicit citation filtering and in-sentence nugget coverage enforcement (Lawrie et al., 17 Sep 2025, Lin et al., 2023, Lawrie et al., 18 Nov 2025).
The included RRF-fusion starting point ensures that future reranker research is not limited by sparse first-stage retrieval (e.g., BM25), reflecting the shift to neural/hybrid retrieval (Lawrie et al., 18 Nov 2025).
5. Evaluation Protocols and Metrics
Standard IR and report-specific metrics are supported:
| Metric | Formula and Details |
|---|---|
| nDCG@k | , |
| MAP | |
| Precision@k | |
| Recall@k | Fraction of relevant docs in top k |
| RBP | , $0 < p < 1$ (truncated) |
| ARGUE (Report Gen.) | Mean of Citation Precision, Nugget Recall, Nugget Support, Sentence Support ([0,1] scale) |
Citation Precision, Nugget Recall, Nugget Support, and Sentence Support metrics are supported for report-generation (ARGUE protocol), rewarding both retrieval faithfulness and nugget coverage (Lawrie et al., 17 Sep 2025).
6. Empirical Findings and Lessons
Analysis of 2022–2024 submissions shows:
- Top performance is achieved by pipelines combining dense+sparse fusion, large neural rerankers (mT5, monoT5), and LLM re-ranking (e.g., GPT-4), with nDCG@20 reaching .664 for Chinese CLIR and .698 for Persian CLIR. Report Generation best runs yield ARGUE scores up to 0.872 (Persian), with Citation Precision up to 0.918.
- Neural reranking consistently outperforms pure sparse or first-stage dense approaches.
- Fusing both original and translated representations (query and/or document side) provides more robust cross-language coverage, with some top runs bypassing translation for fully cross-lingual models.
- MLIR remains fundamentally more challenging than single-language CLIR; cross-language score calibration is essential, motivating learned fusion models.
- Relevance pools constructed from neural submissions demonstrate high reusability and stability (leave-one-run/team-out τ > 0.97 for nDCG@20).
- Technical-document CLIR requires dense retrieval tuned to technical domains, but LLM rerankers substantially close the gap to sparse models.
- ARGUE-based report evaluation is feasible and discriminative, but Citation Precision is still moderate (≈0.3 for some settings), indicating headroom for improvement in retrieval faithfulness (Lawrie et al., 17 Sep 2025, Lawrie et al., 11 Apr 2024, Lin et al., 2023).
7. Availability, Extensibility, and Future Directions
NeuCLIRBench is openly distributed via Hugging Face Datasets, including all documents (native and English translations), topics (with translations and descriptions), qrels, and baseline/fusion runs. Evaluation may be performed with trec_eval or pytrec_eval using provided qrels, and judged@k metrics help validate the coverage of system outputs.
The benchmark codifies a multi-stage architecture supporting extension to new retrieval, fusion, and reranking strategies. Forthcoming directions include:
- Integration of bi-encoder mT5 retrieval, contrastive bilingual pretraining, and dynamic hybrid pseudo-relevance feedback.
- Expansion of technical CLIR scenarios and generative report evaluation (e.g., with richer MLIR tasks and fairness-aware metrics).
- Full reproducibility with open-source indexing and retrieval scripts, facilitating rigorous system comparison (Lawrie et al., 18 Nov 2025, Lin et al., 2023, Lawrie et al., 17 Sep 2025).
NeuCLIRBench substantiates a new standard for cross-language and multilingual retrieval experimentation, catalyzing progress in neural and fusion-based IR, multilingual report generation, and retrieval-augmented language modeling.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free