TREC NeuCLIR 2024 Collection

Updated 26 January 2026

TREC NeuCLIR 2024 Collection is a vast benchmark suite that evaluates neural cross-language and multilingual retrieval using extensive news and technical corpora.
It employs rigorous deep pooling and active learning for relevance judgments, yielding high statistical power and robust evaluation metrics.
The collection underpins fact-grounded report generation with nugget-based architectures, advancing retrieval-augmented generation research.

The TREC NeuCLIR 2024 Collection is the preeminent benchmark suite for evaluating neural cross-language information retrieval (CLIR), multilingual information retrieval (MLIR), and report generation systems over large-scale, multi-source document corpora. Building on the HC4 test collections and expanded for the TREC Neural CLIR track, this suite provides rigorously constructed datasets, in-depth graded relevance judgments, and comprehensive task protocols supporting both ad hoc search and retrieval-augmented generation. It encompasses native and machine-translated news archives in Chinese, Persian, and Russian, as well as a technical abstracts corpus, with thousands of queries and over a quarter million deep relevance judgments. The collection is foundational for methodologically robust CLIR and MLIR experimentation and now underpins RAG architectures leveraging explicit nugget-oriented factual evaluation.

1. Collection Composition and Origins

The TREC NeuCLIR 2024 Collection is anchored in news and technical-domain corpora:

News Corpora (“NeuCLIR-1”):
- Chinese: ~3.18M documents (Common Crawl News, 2016–2021)
- Persian: ~2.23M documents
- Russian: ~4.63M documents (down-sampled for corpus balance)
- Each corpus is natively authored and paired with machine translations into English using a high-quality Sockeye 2 Transformer model (FLORES-200 BLEU: 32.7–39.4; NTREX-128 BLEU: 29.3–30.5).
Technical Abstracts:
- 396K Chinese academic abstracts (CSL dataset, 67 disciplines).
Judged Topics:
- 56 Chinese, 68 Persian, 64 Russian topics for news CLIR.
- 52 MLIR topics (requiring ≥2 relevant docs across languages).
- 71 technical CLIR topics post pool-based pruning.
Topic Formulation:
- Each topic has a “title + description + narrative” structure; runtime retrieval uses “title + description.”
- Topics are manually authored in English and professionally translated.
- Topic creation is tightly coupled to corpus presence—iterative pilot searches ensure each query covers a viable scope and content availability (Lawrie et al., 17 Sep 2025, Lawrie et al., 18 Nov 2025).

2. Relevance Assessment and Annotation Methodologies

Relevance judgments underpin the discriminatory power and depth of the collection:

Judgment Scales:
- News: Four-point (0–3), mapped to a three-point scale for evaluation (3→3, 2→1, 1→0, 0→0).
- Technical: Three-valued (“very”, “somewhat”, “not” valuable).
Pooling and Topic Pruning:
- Pools constructed via the top 100 docs per team for each topic (news), top 35 (technical).
- Deep pooling ensures nearly complete coverage for nDCG@20 and strong statistical power: ~1,500–1,700 judgments per topic.
- Topics with <3 or >20% relevant docs were excluded to stabilize evaluation.
Annotation Strategies:
- HiCAL (High-Recall Interactive CALibration): An active learning iterative classifier selects high-probability relevant documents for manual judgment, ceasing with 20 consecutive non-relevant assessments or preset topic relevance criteria (Lawrie et al., 2022).
- All relevance assignments are done by bilingual assessors on the native-language text.

3. Tasks, Retrieval Scenarios, and Evaluation Protocols

The collection supports a broad array of retrieval and generative tasks:

Task	Source Language	Target Document(s)	Retrieval Output
News CLIR	English	Chinese, Persian, or Russian	Ranked list (1,000 docs)
News MLIR	English	Chinese∪Persian∪Russian	Single unified list (1,000)
Technical CLIR	English	Chinese abstracts	Ranked list (1,000 docs)
Report Generation (Pilot)	English	News docs (target language)	Citation-grounded report
Monolingual Retrieval	C/P/R/E	Native language	(Baseline, for pool depth)

Protocols:
- “Constrained” condition: no use of external bitext beyond defined corpora
- “Open” condition: external MT/dictionary resources permitted
- Submissions: up to six per language pair, supporting system diversity
Retrieval Scenarios:
- Monolingual, cross-language (English→foreign), and MLIR are all explicitly supported.
- Queries written in English or translated manually to maximize real-world comparability.
- Multilingual ad-hoc ranking presents unique calibration challenges due to score normalization across languages (Lawrie et al., 17 Sep 2025, Lawrie et al., 18 Nov 2025).

4. Baseline Systems and Neural Paradigms

TREC NeuCLIR 2024 catalyzed the transition from sparse, lexical CLIR baselines to dense, neural and generative-retrieval pipelines:

Sparse Baselines:
- BM25 (Patapsco) in QGT (query translation), DT (document translation), and HT (human translation) configurations, all with/without RM3 expansion.
- Probabilistic Structured Queries (HMM-based PSQ).
Dense and Hybrid Models:
- PLAID-X: Twin-encoder, multi-vector dense retrieval, XLM-RoBERTa Large distilled from mT5.
- MILCO: Learned-sparse retrieval, XLM-RoBERTa 0.6B-based, crosslingual expansion.
- Qwen3-Embeddings: Single-vector dense.
- Fusion: Reciprocal Rank Fusion (RRF) of PLAID-X, MILCO, and Qwen3 yields nDCG@20 ≈ 0.59 (cross-language), ≈0.47 (multilingual) (Lawrie et al., 18 Nov 2025).
Neural Rerankers:
- Pointwise and listwise LLM rerankers (Qwen3, mT5, Rank1, Rank-K) over dense fusion outputs.
Dominant Pipelines:
- Two-stage retrieval: dense (Orig+DT) first-stage, then LLM-based crosslingual reranking (MonoT5, GPT-4, Claude 2), with retrieval scoring $score(d,q) = \lambda\, s_{dense}(d,q) + (1-\lambda)\,s_{sparse}(d,q)$ .
- MLIR uses score fusion and generative reranking to temper cross-language disparity.

5. Evaluation Metrics and Statistical Power

Robust evaluation is enabled by deep pooling, granular labeling, and standardized metrics:

Primary Metrics:
- nDCG@k: $DCG@k = \sum_{i=1}^k (2^{\text{gain}_i} - 1) / \log_2(i + 1)$
- MAP@k, AP, Recall@k, Precision@k
- Rank-Biased Precision (RBP), Mean Reciprocal Rank (MRR) (secondary)
Report Generation:
- Auto-Argue toolkit for nugget-based evaluation: combines citation precision, nugget recall, nugget support, and sentence support (Dietz et al., 19 Jan 2026).
- Explicit metrics:
- Nugget recall $R_n = |\,\mathrm{Covered}(G)\,|\,/\,|G|$ ,
- Nugget density $D_n = |\,\mathrm{Covered}(G)|\,/\,|C|$ ,
- Citation grounding $G_c = |\,\mathrm{Supported}(H)|\,/\,|H|$ .
Pooling and Topic Stability:
- All primary tasks achieve “deep” pooling (≥0.98 “judged@20”).
- Leave-one-run-out and leave-one-team-out analyses show Kendall’s τ ≥ 0.94, ensuring benchmark stability (Lawrie et al., 17 Sep 2025, Lawrie et al., 18 Nov 2025).

6. Report Generation: Nugget-Based RAG and Empirical Insights

The 2024 pilot contains an explicit report-writing benchmark with granular gold-standard “nugget” annotations:

Example System (“Crucible”):
- Initial retrieval via PLAID-X and alternatives (Qwen3, MILCO).
- Q&A nugget bank generated per topic with LLMs; paraphrase deduplication using pairwise LLM calls.
- SVC-based nugget ranking, sentence extraction, and per-nugget sentence assembly with explicit citation.
- Optional LLM-judge verification mitigates citation drift.
Reported Performance (Rₙ: nugget recall, Dₙ: density, G_c: citation grounding):

System	Rₙ	Dₙ	G_c
Crucible	0.429	0.448	0.902
Crucible + Verify	0.438	0.457	0.961
GINGER (GPT-4o)	0.177	0.131	0.571
BulletPoints	0.508	0.340	0.835

Key Implications:
- Nugget-centric architectures ensure systematic coverage and robust citation traceability.
- Explicit per-fact sentence extraction and verification outperform cluster-based or purely latent methods for grounded generation.
- Nugget coverage and citation precision remain below 0.5 and 0.9, respectively, in the best runs, with recall posing an open challenge (Dietz et al., 19 Jan 2026).

7. Impact, Accessibility, and Continued Benchmarking

The TREC NeuCLIR 2024 Collection, as embodied in NeuCLIRBench, sets the standard for controlled research in CLIR and MLIR:

Statistical Discrimination:
- 250,128 labelings over ca. 150 topics for mono/cross-language and 100 topics for MLIR provide strong significance power.
Breaks Lexical Baseline Dependence:
- Dense fusion and neural reranking now define the performance frontier, supporting method development outside the traditional BM25 paradigm.
Public Access and Ongoing Use:
- Datasets and code are openly available at neuclir.github.io and Hugging Face (Lawrie et al., 18 Nov 2025).
- Collections inform emerging tracks, such as RAGTIME 2025, dedicated to multi-lingual RAG and summarization evaluation.
A plausible implication is that due to deep pools and multilingual topic diversity, NeuCLIR 2024 is now the reference for both system development and model-based evaluation, especially for fact-grounded RAG in low-resource and cross-domain scenarios.

References

HC4: A New Suite of Test Collections for Ad Hoc CLIR (Lawrie et al., 2022)
Overview of the TREC 2024 NeuCLIR Track (Lawrie et al., 17 Sep 2025)
NeuCLIRBench: A Modern Evaluation Collection for Monolingual, Cross-Language, and Multilingual Information Retrieval (Lawrie et al., 18 Nov 2025)
Incorporating Q&A Nuggets into Retrieval-Augmented Generation (Dietz et al., 19 Jan 2026)

Markdown Report Issue Upgrade to Chat

References (4)

Overview of the TREC 2024 NeuCLIR Track (2025)

NeuCLIRBench: A Modern Evaluation Collection for Monolingual, Cross-Language, and Multilingual Information Retrieval (2025)

HC4: A New Suite of Test Collections for Ad Hoc CLIR (2022)

Incorporating Q&A Nuggets into Retrieval-Augmented Generation (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TREC NeuCLIR 2024 Collection.