Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

120 tokens/sec

GPT-4o

7 tokens/sec

Gemini 2.5 Pro Pro

46 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

CLAIM-BENCH: Scientific Claim-Evidence Benchmark

Updated 6 July 2025

CLAIM-BENCH is a benchmark framework that maps complex claim–evidence relationships in full-length research papers to assess LLM performance.
It employs detailed annotation schemas and evaluation metrics such as precision, recall, F1-score, and a unique sentence_gap measure.
Comparative experiments using iterative prompting strategies reveal trade-offs between recall, precision, and computational efficiency in scientific reasoning.

CLAIM-BENCH is a comprehensive benchmark and evaluation framework for scientific claim–evidence extraction and validation, designed to test and compare LLMs on their ability to process complex scientific documents and reason over explicit claim–evidence relationships. It introduces detailed annotation procedures, multiple evaluation protocols, and a set of diagnostic metrics which together address the multifaceted challenges inherent in scientific argument comprehension (2506.08235).

1. Benchmark Architecture and Annotation Schema

CLAIM-BENCH operates over a dataset of full-length research papers, each annotated by domain experts with a one-to-many mapping between claims—explicit scientific assertions, hypotheses, or findings—and their corresponding supporting or contradicting evidence. The annotation process employs an interactive tool for marking text spans of both claims and evidential fragments, exporting outputs in structured formats (e.g., JSON) that record:

The claim text and its location/type.
Evidence text spans, which may be non-contiguous and often widely dispersed across the paper.
Metadata enabling the computation of contextual distances (such as the “sentence_gap” metric).

This structured approach ensures that CLAIM-BENCH moves beyond classification and sentence extraction tasks, embedding full-document scientific argument mapping into its evaluation protocol.

2. Evaluation Protocols and Metrics

CLAIM-BENCH adopts several metrics to quantify system performance in scientific comprehension:

Precision ( $P$ ): $P = \frac{TP}{TP + FP}$ , where $TP$ is the number of true positives (correctly matched pairs), $FP$ is the number of false positives.
Recall ( $R$ ): $R = \frac{TP}{TP + FN}$ , with $FN$ the number of false negatives.
F1-score: $F_1 = \frac{2 \times P \times R}{P + R}$ ; the harmonic mean of precision and recall.
Sentence_gap: A novel evaluation metric quantifying the average sentence distance between extracted claims and their corresponding evidence:

$\text{sentence\_gap} = \frac{1}{|M|} \sum_{(p,g) \in M} |s(p) - s(g)|$

where $M$ is the set of matched claim–evidence pairs (as defined by intersection-over-union span matching), and $s(\cdot)$ is the sentence index.

Secondary metrics include execution time and recall as a function of document size, providing additional insight into the scalability and efficiency of system strategies.

3. Prompting Strategies and Experimental Design

Experiments within CLAIM-BENCH systematically compare three distinct prompting strategies, each designed to decompose the scientific reasoning task in a different manner:

Single-Pass Prompting: A model is provided the entire paper and tasked with extracting claims, evidence links, and conclusions in a single prompt. This approach is computationally efficient but exhibits declining recall as document length increases.
Three-Pass Prompting: The extraction process is divided into three sequenced prompts: (1) extraction of all claims, (2) provision of these claims as input to evidence extraction, (3) conclusion generation. This structure improves recall and facilitates focused reasoning for each sub-task.
One-by-One Prompting: Each claim is handled in isolation. The model receives one claim per prompt to extract supporting evidence and later to generate a conclusion. Although this method maximizes recall and accuracy in evidence linkage, it incurs significantly greater computational cost due to repeated document traversal.

This prompting apparatus is specifically engineered to elucidate the strengths and weaknesses of both model architectures and extraction strategies under conditions that reflect real-world scientific reasoning.

4. Comparative Analysis of Model Performance

Six state-of-the-art LLMs are benchmarked under these prompting protocols, spanning closed-source (e.g., GPT-4, Claude, Gemini) and open-source (e.g., LLaMA-70B, Ministral-8B, Phi-3.5-MoE) families. Empirical findings from the benchmark include:

GPT-4-Turbo yields high precision (0.68) and recall (0.81) for claim extraction; Claude offers the highest recall (0.83) with moderate precision (0.61), often generating a wider set of linked claim–evidence pairs (increasing recall at potential cost to specificity).
Open-source LLaMA-70B can match closed-source models on recall but at the expense of lower average precision (due to more frequent false positives), while Ministral-8B is conservative, favoring precision over recall and potentially missing less salient evidence.
Performance drops significantly in Single-Pass prompting as document length (token count) increases, while iterative approaches (Three-Pass and especially One-by-One) preserve high recall and provide more consistent sentence_gap results.

Across all cases, closed-source LLMs outperform open-source counterparts in balanced F1, long-context handling, and reduction of error in evidence linkage.

5. Scientific and Technical Implications

CLAIM-BENCH sets a robust standard for AI systems intended to support scientific literature review, peer review automation, and advanced scientific question answering by integrating long-context reasoning with explicit claim–evidence validation capabilities. Key implications include:

Iterative Reasoning Gains: Three-Pass and One-by-One prompting strategies mitigate context-length-induced recalls drops and improve fine-grained claim–evidence association, suggesting future systems should adopt similar staged or recursive processing where possible.
Long-Context Understanding: The ability to link claims and evidence over large sentence gaps or widely separated rhetorical zones is essential for scientific comprehension, emphasizing the need for LLM architectures optimized for long-range dependencies.
Precision–Recall Trade-off: While some models excel in recall or precision, few achieve both. Tuning extraction strategies, model size, and evaluation objectives is necessary for optimal scientific utility.
Computational Considerations: The enhanced accuracy of iterative prompting comes at substantial cost, clearly motivating further research into more efficient, context-aware decomposition and compression methods.

6. Limitations and Research Opportunities

CLAIM-BENCH highlights several outstanding challenges:

The data and benchmark, being curated on non-math–intensive, English-language papers from 2024, may have limited generality for other disciplines or languages.
Annotation agreement remains modest (e.g., Cohen’s $\kappa$ of 0.66 for claim spans and 0.30 for evidence), reflective of inherent subjectivity in mapping scientific argumentation.
Linking evidence across thousands of sentences, as sometimes quantified by extreme sentence_gap values, remains difficult and is a substantive area for advancement.
The trade-offs observed between high recall (with noisy links) and conservative precision (with missed evidence) highlight the need for improved calibration and possibly hierarchical or hybrid modeling solutions.

A plausible implication is that further development in model calibration, document decomposition, and task-aware long-context handling are necessary for true scalable scientific validation using LLMs.

7. Outlook and Diagnostic Uses

By providing a rigorous task formulation, validated metrics, and an extensible annotation schema, CLAIM-BENCH offers both a critical diagnostic tool for model evaluation and a concrete pathway toward the development of LLMs better aligned with the needs of scientific reasoning. It enables measurement not only of extraction accuracy but also of diagnostic behaviors such as evidence-linking strategy, error localization, and computational efficiency. These dimensions enable systematic comparative research, principled model improvements, and direct benchmarking for downstream applications where trustworthy scientific understanding is required.

In summary, CLAIM-BENCH constitutes an advanced evaluation suite for claim–evidence reasoning in scientific documents, elucidating both the capabilities and limitations of leading LLMs and setting a principled research agenda for the next generation of automated scientific comprehension systems (2506.08235).

PDF Markdown Chat (Upgrade)

References (1)

Can AI Validate Science? Benchmarking LLMs for Accurate Scientific Claim $\rightarrow$ Evidence Reasoning (2025)