Synthetic Relevance Judgment

Updated 2 January 2026

Synthetic relevance judgment is a method that uses LLMs to automatically assign relevance labels to query–document pairs, reducing reliance on manual annotation.
It employs candidate retrieval and careful prompt engineering with models like GPT-4 and LLaMA to scale and extend IR test collections effectively.
Evaluation utilizes metrics such as Cohen's κ and Kendall's τ to assess label fidelity, while challenges like bias and prompt transparency warrant further research.

Synthetic relevance judgment refers to the automatic generation of relevance labels for query–document pairs using computational models, typically LLMs, instead of relying on manual annotation by human assessors. This practice has become prominent in the evaluation of information retrieval (IR) systems due to the escalating annotation costs associated with modern, large-scale test collections. Synthetic judgments are most commonly used in constructing and extending Cranfield-style IR test collections, where each query–passage pair must be assessed for retrieval system evaluation, ranking, and tuning. The 4-point relevance scale—{0 = irrelevant, 1 = related, 2 = highly relevant, 3 = perfectly relevant}—is now a standard in such synthetic pipelines, as exemplified by the LLMJudge shared task based on the TREC Deep Learning 2023 passage retrieval task (Rahmani et al., 2024).

1. Motivation and Problem Definition

The primary motivation for synthetic relevance judgments is the exceptional resource intensity of human annotation for large (or rapidly changing) IR test collections. Constructing a TREC-style collection with dozens of topics may require thousands of judgments, expert training, adjudication, and weeks of effort by multiple contractors. Automated judgment with LLMs offers:

Reduction of manual labeling costs and time.
Increased scalability in both query and topic coverage.
Potential for expanding collections into new domains or languages lacking expert annotators.

The typical use case is the assignment of a relevance label $r \in \{0,1,2,3\}$ to each (query, passage) pair. The intent is to maximize the utility and comparability of IR evaluation benchmarks while minimizing human annotation bottlenecks (Rahmani et al., 2024).

2. Methodologies for Generating Synthetic Judgments

The procedural workflow for synthetic labeling is as follows:

Candidate Retrieval: For each query in the evaluation set (e.g., from TREC-DL 2023), retrieve or sample a pool of candidate passages. The retrieval strategy is participant-dependent, and may use classical or neural methods.
Prompting an LLM: The (query, passage) pair is provided as input to an LLM. The LLM is prompted—using either zero-shot or few-shot, and with prompt design varying by group—to output a scalar relevance label on the canonical 0–3 scale. The LLM can be commercial (e.g., GPT-4), fine-tuned open source, or both.
Label Collection: The synthetic labels (qrels) for all candidate pairs are returned as the system's relevance judgments. These can fill "holes" in the test collection (unlabeled pairs) or be used to re-annotate the entire pool for consistency.

Key variations include prompt engineering (few-shot vs. zero-shot; rubric-based vs. direct scoring) and choice of model (closed-source vs. open-source; fine-tuning status). The pipeline has been operationalized in community challenges such as LLMJudge, which standardizes evaluation on the TREC-DL 2023 data (Rahmani et al., 2024). Exact prompt templates and code are hosted in repositories such as github.com/llm4eval/LLMJudge.

3. Model and Prompt Choices

Although the LLMJudge challenge (Rahmani et al., 2024) does not enumerate all models or hyperparameters used, it establishes two dominant categories:

Closed-source LLMs: Example—GPT-4, accessed via API, typically using zero- or few-shot prompting with deterministic sampling (temperature = 0).
Fine-tuned Open-source LLMs: Example—LLaMA or similar models, fine-tuned on task-relevant (or even human-labeled) pairs, activated via customized prompts.

Prompt-based engineering is critical to judgment accuracy. Options include minimal instruction, elaborate few-shot demonstration of correct labels, rubrics describing relevance levels, and chain-of-thought (CoT) rationales. The challenge data points to prompt design and model selection as the primary determinants of label quality.

4. Evaluation Frameworks and Metrics

Evaluation of synthetic relevance judgments proceeds at two levels, each quantified by a canonical metric:

Point-wise Agreement: The extent to which synthetic labels agree with human labels for individual (query, passage) pairs. This is measured by Cohen's κ, where values observed for LLM-based labelers cluster in the 0.3–0.6 range—considered fair-to-moderate agreement.
System-level Ordering: The degree to which retrieval systems are ranked in the same order when evaluated under synthetic vs. human judgments. This is assessed by Kendall's τ, with values tightly clustered around τ ≈ 0.8. Notably, system ranking agreement demonstrates lower variance than the pointwise κ, indicating synthetic judgments can preserve leaderboard fidelity even when label-level agreement is imperfect (Rahmani et al., 2024).

Repeated findings across studies confirm that synthetic judgment pools can robustly reproduce system orderings but tend to inflate absolute system effectiveness scores and display variable labelwise alignment (Rahmani et al., 2024).

5. Empirical Findings and Observed Biases

In the LLMJudge challenge, 39 systems from seven international groups provided synthetic labels, which were benchmarked against a held-out human-labeled test set. Key findings:

System ranking by Kendall's τ: Most synthetic labelers produced system orderings highly consistent with those derived from human assessments. Variance in τ was low, signifying stability of competitive result orderings even as label-level agreement fluctuated.
Label-level correlation: Cohen's κ for individual judgments ranged more widely, indicating diversity in how LLMs mapped fine-grained relevance decisions vis-à-vis the human reference.
No detailed bias or leakage analysis: The main challenge report (Rahmani et al., 2024) does not present results on topical bias, length bias, or leakage from training corpora into labeling decisions. Nor does it conduct statistical tests beyond κ and τ.
Prompt and model dependence: The organizers emphasize that choice of prompt and LLM substantially affects outcome quality, with neither open- nor closed-source models clearly dominating absent prompt-specific tuning.

6. Current Limitations and Open Questions

Critical limitations and open technical challenges remain:

Prompt transparency: The exact prompt templates contributing most to judgment fidelity are provided only in external repositories; they are not standardized in the literature.
Model list and hyperparameters: No comprehensive summary exists of all models (architectures, sizes) used in benchmark submissions, nor of their training regimens.
Bias and leakage: Robust quantification of systematic biases in synthetic labels—such as model in-group bias, over-relevance, or leakage from training data—remains an open research agenda for future benchmarking.
Lessons learned: The challenge report offers no formal best-practice recommendations but strongly cautions practitioners to (a) experiment with multiple prompts and model classes, and (b) conduct label calibration to ensure synthetic pools do not distort evaluation outcomes.

7. Practical Guidance for Deployment

For practitioners intending to utilize or benchmark synthetic relevance judgments:

Clone community repositories (e.g., LLMJudge) for access to exemplar prompts and code.
Select or construct a retrieval pipeline to surface document candidates for each query.
Choose multiple LLMs, considering fine-tuning and prompt engineering, and iteratively optimize prompt quality.
Score all (query, passage) pairs under evaluation, adhering strictly to the target relevance scale.
Validate label fidelity using Cohen's κ and system-level Kendall's τ against available human references.
Remain vigilant for model-induced biases, calibration drift, or prompt leakage, especially in high-stakes or comparative evaluation tasks.

In summary, the generation of synthetic relevance judgments using LLMs is now a viable methodology for constructing and extending IR test collections. While system-level rankings can be stably reproduced, practitioners must carefully manage prompt engineering, model selection, and validation if synthetic judgments are to supplement or replace human annotation (Rahmani et al., 2024).

PDF Markdown Chat (Pro)

References (1)

LLMJudge: LLMs for Relevance Judgments (2024)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Synthetic Relevance Judgment.

Synthetic Relevance Judgment

1. Motivation and Problem Definition

2. Methodologies for Generating Synthetic Judgments

3. Model and Prompt Choices

4. Evaluation Frameworks and Metrics

5. Empirical Findings and Observed Biases

6. Current Limitations and Open Questions

7. Practical Guidance for Deployment

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Synthetic Relevance Judgment

1. Motivation and Problem Definition

2. Methodologies for Generating Synthetic Judgments

3. Model and Prompt Choices

4. Evaluation Frameworks and Metrics

5. Empirical Findings and Observed Biases

6. Current Limitations and Open Questions

7. Practical Guidance for Deployment

Sponsor

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research