Synthetic Test Collections in IR

Updated 2 January 2026

Synthetic test collections are systematically generated datasets designed to evaluate information retrieval systems by simulating human relevance judgments.
They employ advanced LLM prompting methods, including zero-shot, few-shot, chain-of-thought, and ensemble techniques to generate synthetic queries and labels.
These collections provide scalable and cost-effective IR evaluation while raising challenges in bias control, calibration, and avoiding circular evaluation.

Synthetic test collections are systematically constructed datasets used for the evaluation of information retrieval (IR), ranking, and related systems, in which a substantial proportion of queries, documents, and/or relevance judgments are generated or labeled automatically rather than solely through traditional human annotation. These collections enable large-scale, cost-effective, and replicable evaluation, leveraging generative models—especially LLMs and their multimodal counterparts—across diverse domains, including but not limited to ad hoc passage, web, e-commerce, multilingual, multimedia, legal, and specialized scenario search.

1. Foundations and Evolution of Synthetic Test Collections

Synthetic test collections originated as an extension of the Cranfield paradigm, which has underpinned IR evaluation since the 1960s by pairing standardized sets of queries with human-curated relevance assessments. Historically, bottlenecks in corpus construction arose from the need for expensive manual labeling of large numbers of query-document pairs. Advancements in deep generative models, particularly LLMs, have enabled automatic synthesis of both queries and judgments (Rahmani et al., 2024, Rahmani et al., 2024, Rahmani et al., 2024). Early efforts in the area included fully synthetic document generation for relevance evaluation (Lioma et al., 2016) and initial data-driven labeling schemes that interpolated graded judgments numerically rather than through semantic understanding (Moniz et al., 2016). The current landscape includes large, openly released synthetic qrels sets as well as methodologies for both text and multimodal (e.g., image–text) evaluation tasks (Yang et al., 2024).

2. Methodologies for Synthetic Relevance Judgment

2.1 LLM-based Labeling Pipelines

Contemporary synthetic test collections rely primarily on LLM-based assessment, in which a model is prompted to assign a graded relevance label—typically on a 0–3 or 0–4 ordinal scale—to each query–document pair. Pipeline architectures include:

Zero-shot/few-shot prompting: Presenting the LLM with explicit task definitions and sometimes exemplar (Q,D,label) triples (Jesus et al., 2024, Rahmani et al., 2024, Rahmani et al., 2024).
Chain-of-thought and criteria-based prompts: Decomposing the notion of relevance into subcriteria, such as exactness, topicality, coverage, and contextual fit, either prompting the LLM to score each separately and aggregate, or using multi-phase rubric-based elicitation (Farzi et al., 13 Jul 2025).
Fine-tuned open-source judging models: Training smaller, topic-specific classifiers (e.g., monoT5+LoRA) to mimic a given assessor, with significant gains in system ranking reliability (Gienapp et al., 6 Oct 2025).
Ensemble or blending methods: Aggregating multiple LLMs and/or prompt variants via majority or average voting to stabilize label variance and control intra-LM biases (Rahmani et al., 2024).
Multimodal extension: Employing instruction-tuned vision–LLMs (VLMs) such as GPT-4V, LLaVA, and CLIP derivatives, calibrated for label mapping post-hoc or through quantile-thresholding (Yang et al., 2024).

2.2 Synthetic Query and Document Generation

Synthetic test collections frequently require the generation of new test queries or even synthetic “pseudo-relevant” documents:

Synthetic query generation (QGen): Using LLMs in label-conditioned or pairwise prompt schemes to produce queries for a given document and relevance level, with recent work showing jointly conditioned pairwise prompting outperforms absolute generation for zero-shot downstream ranking (Chaudhary et al., 2023).
Document synthesis: Generating synthetic “ideal” relevant documents per query (e.g., char-level LSTM generation) to serve as ground-truth or hybrid evaluation objects (Lioma et al., 2016).
Semi-supervised pipelines: Employing dual-model frameworks for query synthesis and relevance estimation in data-scarce or domain-specific settings, including mechanisms to balance class distributions and generate fine-grained label coverage (Li et al., 20 Sep 2025).

2.3 Data-driven Approaches and Label Function Estimation

An alternative to semantic LLM labeling involves mapping item-level real-valued signals (e.g., user engagement, popularity) into continuous or piecewise relevance grades using monotonic interpolation functions, then evaluating using nDCG or analogous metrics (Moniz et al., 2016).

3. Evaluation Protocols and Metrics

Synthetic test collections are evaluated by comparing synthetic labels and derivative system leaderboard orderings to human-annotated standards.

Pairwise agreement: Cohen’s $\kappa$ and Krippendorff’s $\alpha$ on the (Q,D) label matrix (per-item agreement).
System ranking fidelity: Kendall’s $\tau$ and Spearman’s $\rho$ correlations between system run orderings under “real” and “synthetic” qrels (leaderboard fidelity).
Statistical significance: Empirical validation is typically via observed concordance in orderings; formal p-values or bootstrap tests remain infrequently reported.
Additional distributional metrics: Kullback–Leibler divergence between label distributions, mean absolute score inflation, and Bland–Altman analysis of bias (Rahmani et al., 12 Jun 2025).
Key reported values: $\kappa \sim 0.24–0.28$ (synthetic vs. human, 4-point scale (Rahmani et al., 2024, Jesus et al., 2024)); $\tau \sim 0.82–0.95$ for system orderings (Rahmani et al., 2024, Rahmani et al., 2024, Rahmani et al., 2024, Farzi et al., 13 Jul 2025).

Table: Typical Synthetic Test Collection Agreement Metrics

Evaluation	Metric	Typical Value
Label agreement	Cohen’s $\kappa$	0.24–0.28
System ranking corr.	Kendall’s $\tau$	0.82–0.95
System ranking corr.	Spearman’s $\rho$	0.85–0.99

Absolute score agreement remains substantially lower than system-level ranking agreement, supporting the use of synthetic collections for comparative evaluation with caveats on absolute performance inflation (Rahmani et al., 12 Jun 2025).

4. Biases, Circularity, and Reliability Concerns

4.1 Bias and Narcissism

Absolute-score inflation: Synthetic labels tend to be more lenient, systematically elevating NDCG and MAP scores per run (Rahmani et al., 12 Jun 2025).
Model “narcissism”: Synthetic judge LLMs may favor retrieval systems built with similar architectures or corpora, e.g., GPT-4 judges providing statistically significant boosts to GPT-based ranking runs (Rahmani et al., 12 Jun 2025).
Evaluation bias in multimodal assessment: VLM-based qrels (e.g., CLIPScore) can strongly favor CLIP-based retrievers; LLM-based VLMs reduce but do not eliminate bias (Yang et al., 2024).

4.2 Circularity and Measurement Collapse

Circular evaluation: If the same (or a similar) LLM is used both to produce synthetic relevance judgments and as a component in a retrieval system, this creates a methodological “performance ceiling” and risks model collapse—the evaluation is only as good as the bias and coverage of the synthetic judge (Soboroff, 2024, Rahmani et al., 12 Jun 2025).
Proposed mitigations: Judge–retriever separation, ensembling diverse LLMs (JudgeBlender), calibration against human qrels, and active spot-checking (Rahmani et al., 2024, Rahmani et al., 2024).

4.3 Label Distributions and Overestimation

Synthetic judges tend to assign fewer “irrelevant” labels and more intermediate-grade labels, e.g., shifting the modal label from 0 to 1 with respect to human annotators; measures such as balanced label sampling and adversarial filtering have been tested to counter these effects (Rahmani et al., 12 Jun 2025, Li et al., 20 Sep 2025).

5. Best Practices, Strengths, and Limitations

5.1 Best Practices

Prompt design: Explicit scale definitions, thorough coverage of edge cases in few-shot exemplars, criterion decomposition, and explicit system messages tuned for determinism (temperature=0) (Rahmani et al., 2024, Farzi et al., 13 Jul 2025).
Validation: Hold back a gold human-labeled subset (~25 queries) for calibration, both for label-level (Cohen’s $\kappa$ ) and system-level (Kendall’s $\tau$ ) agreement (Rahmani et al., 2024, Rahmani et al., 2024).
Pool construction and coverage: Sufficient depth pooling (e.g., k=10, all top candidate runs) to sample “borderline” cases; minimal but rigorous filtering of poor synthetic queries or passages.
Bias assessment: Report both absolute and relative (system ordering) measures, check for per-system-type bias, and whenever possible, use multi-LLM ensemble methods (Gienapp et al., 6 Oct 2025, Rahmani et al., 2024).
Efficiency: Model selection tradeoffs—open-source 7–13B LLMs with prompt or ensemble aggregation methods can rival GPT-4 system-ranking reliability at a fraction of cost and with full reproducibility (Rahmani et al., 2024).

5.2 Strengths

Cost and scalability: Drastic reduction in annotation cost and time, enabling orders-of-magnitude expansion in evaluation set size (Rahmani et al., 2024, Rahmani et al., 2024).
Coverage in low-resource or new domains: Synthetic judges can be rapidly applied in low-resource languages (Jesus et al., 2024), multimedia retrieval (Yang et al., 2024), and other emerging domains.
Leaderboard fidelity: High system-order preservation under synthetic labels supports their use for comparative benchmarking and offline model selection.

5.3 Limitations

Absolute calibration: Synthetic collections do not preserve absolute relevance scale; run scores should not be directly compared to those from human-labeled benchmarks (Rahmani et al., 12 Jun 2025).
Sustained reliability: Model drift (LLM updates), lack of transparency in closed-source models, and ongoing exposure to adversarial or difficult queries pose risks of distributional misalignment (Soboroff, 2024).
Generalization: Topic-specific judge adapters are highly effective for one topic–assessor pair but do not generalize; cross-topic or multi-dimensional relevance assessment remains open (Gienapp et al., 6 Oct 2025).
Evaluation in interactive, evolving, or longitudinal settings: Synthetic test collections primarily support offline leaderboard evaluation.

6. Domain-Specific and Advanced Synthetic Methods

6.1 Multicriteria and Rubric-Based Judging

Decomposing relevance into multiple axes (exactness, topicality, coverage, context) and either aggregating via an LLM prompt or a deterministic mapping improves robustness and allows for auditable, interpretable rationales. Such schemes, as in Multi-Criteria (Farzi et al., 13 Jul 2025), outperform traditional direct-labeling in leaderboard agreement across TREC DL and LLMJudge datasets.

6.2 Ensemble and Hybrid Approaches

JudgeBlender aggregates multiple open-source models and prompt strategies (LLMBlender, PromptBlender), smoothing individual model biases, with system-ranking $\tau$ up to 0.961, outperforming single GPT-4 methods (Rahmani et al., 2024).

6.3 Task-Specific Pipelines

Legal retrieval: Stepwise LLM pipelines model human-like reasoning (material/legal facts) to produce interpretable and expert-aligned annotations, improving domain-specific model performance (Ma et al., 2024).
Semi-supervised balancing in multimedia retrieval: Two-stage query and scoring models achieve balanced label generation for short-video retrieval, boosting both nDCG and online user metrics (Li et al., 20 Sep 2025).

7. Future Directions and Open Challenges

Synthetic test collections represent a pivotal shift in IR evaluation methodology, supporting unprecedented data scale and coverage. Key future research areas include:

Dynamic, continuously updated synthetic evaluation: Addressing evolving corpora, label drift, and adversarial adaptation.
Hybrid human–LLM workflows: Integrating human oversight for edge cases, adjudication, and quality assurance (Soboroff, 2024).
Statistical validation and robust significance testing: Bootstrap, permutation, or random-effects models to quantify leaderboard stability under synthetic qrels (Rahmani et al., 12 Jun 2025).
Continuous bias assessment and mitigation: Multi-LLM or cross-prompter ensemble generation, within- and cross-system bias auditing, and adversarial test case development.
Expansion to new domains and modalities: Systematic support of document, query, and label synthesis for under-served languages, domains (legal, clinical, e-commerce), and modalities (image, video, multimodal).

Synthetic test collections have thus become central to contemporary IR evaluation, combining LLM efficiency with statistical rigor—yet require ongoing methodological vigilance to ensure transparency, reliability, and fair assessment across diverse retrieval scenarios.