Sci2Pol-Bench: Science-to-Policy Benchmark

Updated 2 October 2025

Sci2Pol-Bench is a benchmark framework and corpus that evaluates LLMs on science-to-policy translation using a detailed five-stage taxonomy.
It integrates tasks such as autocompletion, summarization, generation, and verification to measure clarity, factual accuracy, and content coverage.
The framework supports controlled fine-tuning, enabling compact models to outperform larger generalists in generating robust policy briefs.

Sci2Pol-Bench is a targeted benchmark framework and corpus designed specifically for the evaluation and fine-tuning of LLMs on the task of policy brief generation from scientific papers. Developed to address the gap in systematic, fine-grained measurement and improvement of science-to-policy translation capacities in LLMs, Sci2Pol-Bench comprises a multifaceted benchmark suite and a meticulously curated dataset, supporting empirical paper of model performance, controllable text generation, and effective adaptation for policy communication tasks (Wu et al., 25 Sep 2025).

1. Taxonomy and Task Design

Sci2Pol-Bench is underpinned by a five-stage taxonomy that emulates the human process of writing policy briefs, thereby enabling disaggregated and holistic evaluation of LLM abilities. The stages are:

Autocompletion: Tasks require models to complete passages from scientific papers or policy briefs, such as next-sentence prediction or ordering, assessing local coherence and language fluency.
Understanding: Models are prompted to classify sentences into categories (e.g., “Policy Problem,” “Methods,” “Policy Implications”) or to answer factual multiple-choice questions, testing deep comprehension and information extraction.
Summarization: Condensation of long texts into targeted summaries (policy problem, research findings, methods, implications), emphasizing clarity and coverage.
Generation: Section-by-section and whole-brief generation tasks evaluate the ability to synthesize actionable content, separated from mere factual summarization to better measure the generation of coherent, policy-relevant exposition.
Verification: Fact-checking tasks entail comparison between candidate policy brief claims and the originating scientific paper, incentivizing factual accuracy and minimization of unsupported content.

A visual summary of these interrelated stages is provided in the paper as the “Sci2Pol-Taxonomy” (Fig. 1) (Wu et al., 25 Sep 2025).

2. Benchmark Tasks and Evaluation Metrics

The benchmark features 18 tasks, spanning multiple-choice and open-ended formats across the taxonomy stages. For the Generation stage, the paper identifies critical limitations with standard metrics:

BERTScore fails to penalize missing content due to its reliance on overlapping lexical tokens, remaining inappropriately high even as essential information is deleted.
ROUGE is overly sensitive to word order and paraphrase, penalizing meaning-preserving rewrites.

To overcome these issues, an LLM-based evaluation metric is introduced. Gemini-2.5-Pro is used as an automatic, paper-grounded judge, scoring outputs along dimensions such as clarity, factual accuracy, coverage, and overall quality, using structured rubrics. For example, in “Policy Problem Generation” (Task 11), the judge evaluates for completeness in the components: background, problem, consequences, attention, and supporting detail, with a raw score defined as:

$S_\mathrm{raw} = \sum_{c\in C} \mathrm{score}(c),\ \ C = \{\mathrm{background}, \mathrm{problem}, \mathrm{consequence}, \mathrm{attention}, \mathrm{detail} \}$

The final output is then normalized to conform to the task’s scoring protocol. Similar recipe-driven rubrics are applied to other generation and summarization tasks.

3. Sci2Pol-Corpus: Dataset Construction

Sci2Pol-Corpus is the corresponding training set for fine-tuning LLMs on policy brief production. The construction pipeline is as follows (Wu et al., 25 Sep 2025):

Candidate Extraction: 5.6 million policy documents were indexed, with candidate pairs extracted by matching scientific papers to policy documents citing a maximum of three references. This yielded 140,000 potential pairs.
LLM-Based Filtering: In a coarse filtering stage, document abstracts (from SciSciNet) and policy brief text were compared using LLM-prompts (with GPT-o3), discarding pairs where content relevance was low. A follow-up fine filtering involved analyzing the alignment between the full scientific paper and the policy brief, with heuristics for excessive length or summary substitution.
In-context Polishing: The highest quality retained policy briefs were further “polished” by prompting GPT-o3 with three expert-written reference pairs, leading to uniformity in tone, structure, and clarity in the final corpus.
Result: The final dataset comprises 639 expert-aligned, high-fidelity scientific paper–policy brief pairs, suitable for supervised fine-tuning.

4. Fine-Tuning and Comparative Model Evaluation

Using Sci2Pol-Corpus, three leading open-source architectures were fine-tuned:

Model	Parameter Count	Origin
LLaMA-3.1-8B-Instruct	8B	Meta
Gemma-12B-Instruct	12B	Google
Gemma-27B-Instruct	27B	Google

The fine-tuning utilized contemporary adaptation protocols (e.g., LoRA) and domain-specific hyperparameters (cosine learning rate, multiple epochs). Results were benchmarked across the full taxonomy against 13 leading open-source and commercial LLMs, including GPT-4o and DeepSeek-V3 (671B).

Findings:

After adaptation on Sci2Pol-Corpus, all three fine-tuned models achieved consistent performance gains (+2 to +3 average points Stage-wise) compared to their base versions.
Notably, Gemma-27B-Instruct surpassed both GPT-4o and DeepSeek-V3 on Sci2Pol-Bench’s generation tasks, despite being an order-of-magnitude smaller, illustrating the advantage of domain-specific supervision over scale in this specialized setting.
State-of-the-art generalist models struggled, particularly in the generation and verification stages, revealing substantial headroom for science-to-policy specialization.

5. Scientific and Policy Implications

Sci2Pol-Bench and Sci2Pol-Corpus extend beyond mere benchmarking; together, they provide a quantitatively controlled research substrate for understanding, improving, and ultimately automating the science-to-policy translation process:

Granular Diagnostic Capability: The five-stage structure allows identification of subtask-specific LLM limitations (e.g., hallucination in generation but not summarization, or comprehension failures in understanding tasks).
Metric Alignment with Policy Needs: LLM-based judgment overcomes the shortcomings of lexical overlap metrics, mapping more directly to clarity, factual accuracy, and actionable insight, which are requisite for policy application.
Bridging Model Size and Specialization: The demonstration that compact, fine-tuned models outperform vastly larger generalists underlines the leverage of targeted adaptation for domain- and task-specific policy communication.
Potential for Rapid Response Policy Support: By automating the cognition, condensation, and communication stages, Sci2Pol-Bench provides the machinery for timely, robust, and accurate policy brief generation from evolving scientific evidence.

6. Relation to Broader Benchmarking Initiatives and Methodological Significance

Sci2Pol-Bench shares methodological DNA with wider scientific benchmarking efforts, such as protocol control in nonequilibrium statistical mechanics (e.g., NESTbench25 (Whitelam et al., 18 Jun 2025)) and robust dataset partitioning for ML (e.g., BenchMake (Barnard, 29 Jun 2025)). Key points of intersection include:

Systematic Partitioning and Evaluation: The emphasis on rigorous, reproducible benchmarking splits and evaluation metrics aligns with approaches introduced in BenchMake, although Sci2Pol-Bench focuses on textual and policy modalities rather than multimodal edge case coverage.
Grounding in Real-World Application: Like NESTbench25, which offers interpretable targets and analytic optima against which algorithmic performance may be evaluated, Sci2Pol-Bench provides expert references and judge models for the policy brief task, ensuring real-world relevance and actionable validation.
Modularity and Extensibility: The task and metric taxonomy of Sci2Pol-Bench provides a scalable template for future benchmarks targeting translation of other specialized scientific outputs into actionable forms beyond policy (e.g., regulatory, clinical, or technical briefs).

7. Outlook and Impact

Sci2Pol-Bench represents the first systematic effort to operationalize the evaluation and targeted improvement of LLMs for science-to-policy translation at scale (Wu et al., 25 Sep 2025):

It enables controlled, replicable progress in the fidelity and utility of LLM-generated policy outputs, opening new research directions in controllable generation, factuality, and evidence-based communication.
The innovations in automated evaluation, dataset construction, and fine-tuning protocol serve as reference points for future benchmarks at the science–society interface.
By demonstrating that policy-grade outputs can emerge from targeted adaptation rather than sheer parameter scaling, Sci2Pol-Bench provides a paradigm for efficient, practicable deployment of AI in evidence-based decision processes.

A plausible implication is the emergence of semi-automated, expert-validated pipelines for critical science-to-policy tasks across domains—enabling rapid, reliable knowledge transfer in contexts of urgent societal need.