CGBench: Benchmarking Language Model Scientific Reasoning for Clinical Genetics Research (2510.11985v1)

Published 13 Oct 2025 in cs.AI

Abstract: Variant and gene interpretation are fundamental to personalized medicine and translational biomedicine. However, traditional approaches are manual and labor-intensive. Generative LMs can facilitate this process, accelerating the translation of fundamental research into clinically-actionable insights. While existing benchmarks have attempted to quantify the capabilities of LMs for interpreting scientific data, these studies focus on narrow tasks that do not translate to real-world research. To meet these challenges, we introduce CGBench, a robust benchmark that tests reasoning capabilities of LMs on scientific publications. CGBench is built from ClinGen, a resource of expert-curated literature interpretations in clinical genetics. CGBench measures the ability to 1) extract relevant experimental results following precise protocols and guidelines, 2) judge the strength of evidence, and 3) categorize and describe the relevant outcome of experiments. We test 8 different LMs and find that while models show promise, substantial gaps exist in literature interpretation, especially on fine-grained instructions. Reasoning models excel in fine-grained tasks but non-reasoning models are better at high-level interpretations. Finally, we measure LM explanations against human explanations with an LM judge approach, revealing that models often hallucinate or misinterpret results even when correctly classifying evidence. CGBench reveals strengths and weaknesses of LMs for precise interpretation of scientific publications, opening avenues for future research in AI for clinical genetics and science more broadly.

Summary

The paper introduces a benchmark, CGBench, that evaluates LLM scientific reasoning in clinical genetics for evidence curation tasks.
It leverages variant and gene curation tasks with rigorous clinical guidelines, reporting metrics such as 0.420 Precision@5 and 0.63 F1 scores.
The study reveals limitations in nuanced evidence extraction and underscores the need for advanced prompt engineering and domain adaptation.

CGBench: Benchmarking LLM Scientific Reasoning for Clinical Genetics Research

Introduction and Motivation

CGBench introduces a comprehensive benchmark for evaluating the scientific reasoning capabilities of LLMs in the context of clinical genetics. The benchmark is constructed from expert-curated annotations in the ClinGen Evidence Repository (ERepo), focusing on two central tasks in translational genomics: variant curation and gene curation. These tasks are critical for personalized medicine, requiring precise extraction, scoring, and interpretation of experimental evidence from scientific literature. Existing benchmarks in biomedical NLP are limited by their narrow scope and lack of real-world complexity; CGBench addresses this gap by formulating tasks that closely mirror the nuanced workflows of clinical genetics research.

Benchmark Design and Task Formulation

CGBench comprises three primary tasks:

VCI Evidence Scoring (E-Score): Multiclass classification of evidence codes for variant-disease pairs, requiring models to assign codes based on the type and strength of evidence in a publication. The evidence code hierarchy (primary, secondary, tertiary) reflects increasing granularity and complexity.
VCI Evidence Verification (E-Ver): Binary classification to determine whether a specific evidence code is "met" or "not met" for a given variant-disease-paper tuple.
GCI Experimental Evidence Extraction: Structured extraction of experimental evidence relevant to gene-disease validity, including category assignment, explanation, score, and reasoning for score adjustment.

Each task is grounded in rigorous clinical guidelines (ACMG/AMP for VCI, SOPs for GCI), and the benchmark covers a diverse set of diseases, variants, genes, and evidence codes. The dataset is filtered to include only open-access full-text publications, ensuring reproducibility and transparency.

Figure 1: Overview of CGBench tasks and results, illustrating the variant and gene curation workflows and comparative model performance across evidence scoring, verification, and extraction.

Evaluation Methodology

CGBench evaluates eight LLMs, including both closed-source (GPT-4o, Claude Sonnet 3.7, o4-mini, o3-mini) and open-weight models (Qwen2.5 72B, Llama 4, Deepseek R1). Models are tested under zero-shot and in-context learning (ICL) regimes, with prompts designed to maximize task fidelity (chain-of-thought, role-playing, structured context). For evidence scoring and extraction, pass@5 sampling is used to account for stochasticity in model outputs.

Metrics include:

Precision@5 and Recall@5 for multiclass evidence code prediction.
True Positive Rate (TPR), True Negative Rate (TNR), and F1 for binary verification.
Category Matching (precision/recall), Structure Adherence, Normalized MAE, and $\Delta$ Strength for evidence extraction.
LM-as-a-judge agreement for explanation correspondence, calibrated against manual expert review.

Empirical Results

VCI Evidence Scoring

Larger and reasoning-focused models (o4-mini, Deepseek R1) outperform smaller and non-reasoning models on tertiary code prediction, with o4-mini achieving 0.420 Precision@5.
In-context learning improves performance, especially for fine-grained code prediction, with diminishing returns beyond 20-shot ICL.
Model predictions for primary and secondary codes are less sensitive to paper content, suggesting reliance on prior knowledge or variant-disease associations.
Figure 2: Prediction correlation for VCI E-Score across zero-shot models, highlighting systematic differences in model outputs.

Figure 3: Correctness correlation for VCI E-Score across zero-shot models, indicating high agreement on correct predictions but divergent errors.

VCI Evidence Verification

GPT-4o and Deepseek R1 achieve the highest F1 scores (~0.63), but overall performance is modest, with models tending to overpredict "met" status.
In-context learning does not consistently improve verification performance, and description-only ICL is less effective than full-text demonstrations.

GCI Experimental Evidence Extraction

o4-mini achieves the best overall extraction performance (0.186 normalized MAE, 0.445 $\Delta$ Strength), but all models over-extract evidence relative to ground-truth.
Structure adherence is high (>95%) except for Deepseek R1, which fails to consistently produce parseable outputs.
Score adjustment reasoning is poorly captured, with less than 50% agreement on strength changes, indicating a gap in nuanced evidence interpretation.

LM-as-a-Judge Explanation Evaluation

Task-aware LM judge prompts yield the highest agreement with manual expert review (0.744 F1).
In-context learning substantially improves explanation correspondence, with GPT-4o agreement rising from 0.486 (zero-shot) to 0.704 (30-shot).
Models tend to match explanation style rather than substantive reasoning, suggesting that prompt engineering can bias outputs toward human-like explanations.

Implementation Considerations

Prompt Engineering: Task-specific, structured prompts are essential for eliciting high-fidelity outputs. Chain-of-thought and role-playing strategies improve reasoning, but context length constraints limit the number of in-context examples.
Data Filtering: Restricting to open-access full-text papers ensures reproducibility but reduces coverage; only ~20% of cited papers are available for scraping (Figure 4).
Model Selection: Reasoning models excel at fine-grained evidence extraction but may overfit to explanation style. Non-reasoning models perform better on high-level classification.
Evaluation Robustness: Majority voting and positional bias correction are necessary for LM-as-a-judge reliability. Manual calibration is required to validate automated explanation scoring.
Figure 4: Results of attempting to scrape full-text results from VCI and GCI, demonstrating limited open-access coverage.

Limitations and Future Directions

CGBench omits supplementary and multimodal data (tables, figures), which are often critical for variant interpretation. Multi-document reasoning is not addressed, and evidence prioritization remains an open challenge. The benchmark is specific to clinical genetics, and nomenclature inconsistencies may confound model evaluation. Future work should explore agentic systems for evidence synthesis, prompt-tuning for guideline adaptation, and multimodal integration for comprehensive literature interpretation.

Implications and Outlook

CGBench sets a new standard for evaluating LLMs in real-world scientific reasoning tasks, revealing substantial gaps in literature interpretation, especially for fine-grained, protocol-driven instructions. The benchmark highlights the need for advanced prompt engineering, domain adaptation, and explanation alignment to bridge the gap between automated and expert curation. As LLMs are increasingly integrated into biomedical research pipelines, rigorous benchmarks like CGBench will be essential for ensuring reliability, transparency, and clinical utility.

Conclusion

CGBench provides a robust framework for benchmarking LLM scientific reasoning in clinical genetics, exposing strengths and weaknesses across evidence extraction, scoring, and explanation. The results demonstrate that while current models show promise, significant improvements are needed for precise, guideline-driven interpretation of scientific literature. CGBench will facilitate future research in AI for translational biomedicine, driving the development of models and agents capable of expert-level evidence synthesis and interpretation.