LLMs Outperform Experts on Challenging Biology Benchmarks (2505.06108v3)

Published 9 May 2025 in cs.LG, cs.AI, and q-bio.QM

Abstract: This study systematically evaluates 27 frontier LLMs on eight biology benchmarks spanning molecular biology, genetics, cloning, virology, and biosecurity. Models from major AI developers released between November 2022 and April 2025 were assessed through ten independent runs per benchmark. The findings reveal dramatic improvements in biological capabilities. Top model performance increased more than 4-fold on the challenging text-only subset of the Virology Capabilities Test over the study period, with OpenAI's o3 now performing twice as well as expert virologists. Several models now match or exceed expert-level performance on other challenging benchmarks, including the biology subsets of GPQA and WMDP and LAB-Bench CloningScenarios. Contrary to expectations, chain-of-thought did not substantially improve performance over zero-shot evaluation, while extended reasoning features in o3-mini and Claude 3.7 Sonnet typically improved performance as predicted by inference scaling. Benchmarks such as PubMedQA and the MMLU and WMDP biology subsets exhibited performance plateaus well below 100%, suggesting benchmark saturation and errors in the underlying benchmark data. The analysis highlights the need for more sophisticated evaluation methodologies as AI systems continue to advance.

Summary

The paper demonstrates that frontier LLMs now match or exceed expert-level performance across diverse biological tasks using a zero-shot evaluation framework.
It reveals dramatic improvements—up to 4-fold increases on tasks like virology and molecular cloning—while also identifying benchmark saturation in some assessments.
The study underscores the efficiency of zero-shot prompting and calls for more advanced evaluation methodologies to accurately assess complex biological reasoning.

This paper (2505.06108) presents a systematic evaluation of 27 LLMs across eight diverse biology benchmarks. The paper aimed to assess the progress of frontier LLMs in biological knowledge and reasoning from late 2022 to early 2025 and compare their performance against human experts. The evaluation highlights dramatic improvements in LLM capabilities in biology, with several models now matching or exceeding expert-level performance on challenging tasks, while also pointing out limitations in current benchmarking methodologies.

Evaluation Setup

The evaluation framework used was Inspect AI. Each model-benchmark combination was run ten times using zero-shot prompting as the primary method. The benchmarks selected cover various biological domains and question formats:

PubMedQA (1909.06146): 500 multiple-choice (MC) questions testing reasoning over biomedical research abstracts, primarily focusing on quantitative results.
MMLU-Bio (2009.03300): 1300 MC questions across seven biology-related subdisciplines (anatomy, college biology, college medicine, high school biology, medical genetics, professional medicine, virology) assessing factual knowledge.
GPQA-Bio (2311.12022): 78 MC questions from the GPQA_main set tagged as 'Biology', designed to be "Google-proof" and test graduate-level molecular biology and genetics knowledge.
WMDP-Bio (2403.03218): 1273 MC questions from a predefined biology subset, assessing knowledge in potentially sensitive biosecurity domains like bioweapons, reverse genetics, and enhanced pathogens.
LAB-Bench LitQA2 (2407.10362): 199 MC questions requiring information retrieval and reasoning from recent scientific literature (2021-2024). Evaluated without tool use in this paper.
LAB-Bench CloningScenarios (2407.10362): 33 MC questions on complex molecular cloning workflows, designed for tool-assisted agents but evaluated without tools.
LAB-Bench ProtocolQA (2407.10362): 108 MC questions presenting biological protocols with errors and asking to identify corrections. Evaluated without tool use.
VCT-Text (2504.16137): 101 text-only questions from the Virology Capabilities Test, focusing on practical virology knowledge and experimental troubleshooting in a multiple-response (MR) format.

Human performance baselines were included for comparison, though the paper notes significant variability and limitations in the methodology used for some benchmarks, making direct comparisons challenging in certain cases (e.g., MMLU, PubMedQA).

Twenty-seven models from major AI organizations (Anthropic, Google DeepMind, Meta AI, Mistral, OpenAI, DeepSeek, xAI) were evaluated. These included proprietary models accessed via API and open-source models via TogetherAI. The selection aimed to cover the evolution of frontier models from late 2022 to early 2025.

Standardized prompt templates were used for zero-shot evaluations, instructing models on the required answer format. For a subset of models, five-shot and chain-of-thought (CoT) prompting strategies were also tested to assess their impact.

Key Findings

Dramatic Performance Improvements: The paper revealed significant performance gains on several challenging benchmarks over the evaluation period. VCT-Text showed a 4-fold increase in top model performance, GPQA-Bio a 2-fold increase, ProtocolQA a 2.3-fold increase, and CloningScenarios a 2.6-fold increase. This indicates a substantial improvement in models' ability to handle complex biological reasoning tasks.
Expert-Level Performance Achieved: Several frontier models now match or exceed documented expert performance. The top model (o3) scored 46.1% on VCT-Text, more than double the expert virologist baseline of 22.6%. Top models also surpassed expert baselines on GPQA-Bio and WMDP-Bio and matched the expert baseline on CloningScenarios. The paper notes that experts often had advantages like internet access and specialized software during their evaluations, suggesting the models' performance relative to experts might be even more impressive when evaluated under identical conditions.
Benchmark Saturation: Performance on PubMedQA, MMLU-Bio, and WMDP-Bio appeared to plateau well below 100% accuracy (around 75-80% for PubMedQA, 85-90% for MMLU-Bio, 80-85% for WMDP-Bio). The paper interprets this not as a halt in AI progress but as benchmark saturation and potential errors or ambiguities in the benchmark data itself, suggesting these benchmarks are less effective at differentiating the most capable models.
Limited Impact of Prompting Strategies: Zero-shot evaluation provided a reliable baseline, as five-shot and CoT prompting generally yielded minimal performance improvements across most benchmark-model combinations. CoT, in particular, did not consistently boost accuracy and significantly increased token usage (averaging 75x more output tokens). Exceptions included VCT-Text, where five-shot and CoT showed some benefits for certain models (Llama 3.1-405B, GPT-4o), possibly due to the benchmark's difficulty and MR format.
Inference Scaling: Models with explicit reasoning parameters (o3-mini, Claude 3.7. Sonnet) generally showed improved performance on challenging benchmarks (GPQA-Bio, CloningScenarios, ProtocolQA) with increased reasoning effort (indicated by higher mean output tokens per question). However, an unexpected decrease in performance was observed for Claude 3.7. Sonnet on VCT-Text with increased reasoning, suggesting potential over-analysis or overcomplication on some tasks, especially those with multiple correct answers.

Practical Implications for Implementation

The findings have several practical implications for developers and practitioners building AI applications in biology:

Model Selection: The paper provides empirical data on which frontier models perform best across different types of biological tasks. For applications requiring graduate-level molecular biology/genetics or practical virology troubleshooting, models like o3, Claude 3.7. Sonnet, and Gemini 2.5. Pro demonstrate expert-level capabilities. For tasks involving analyzing biomedical literature, models like Llama 3.1-405B and o3-mini show strong performance.
Prompting Strategy: The finding that zero-shot performance is often comparable to few-shot or CoT performance for many benchmarks is highly relevant. Implementing zero-shot prompting is simpler and significantly more token-efficient ( $75 \times$ fewer tokens on average compared to CoT), leading to lower computational costs and faster inference times. For applications where cost and latency are critical, zero-shot should be the default starting point, potentially using few-shot or increased reasoning parameters only for demonstrably harder sub-tasks (like complex troubleshooting as in VCT).
Evaluation Design: The paper strongly advocates for more sophisticated evaluation methodologies. Relying solely on simple MCQs or benchmarks showing saturation is insufficient for assessing the capabilities of frontier models. Practitioners need to consider using or developing:
- Benchmarks that assess agentic capabilities (tool use, literature retrieval, data analysis) rather than just factual recall, such as components of LAB-Bench, BioPlanner (2310.10632), or BixBench (2503.00096).
- Benchmarks with higher difficulty ceilings and formats less susceptible to guessing, like multiple-response (VCT) or exact match (FrontierMath (2411.0487)).
- Benchmarks with more rigorous and transparent human baselines to provide meaningful context for model performance.
- Potentially, predictive biology benchmarks based on experimental outcomes, which could test capabilities beyond current human expert limits and provide unambiguous ground truth, mitigating the issue of expert-model disagreement on difficult questions.
Interpreting Benchmarks: Understand that high scores on saturated benchmarks (like MMLU-Bio or PubMedQA) do not necessarily mean a model is state-of-the-art for complex tasks. Focus on performance on challenging, unsaturated benchmarks (like GPQA-Bio, CloningScenarios, ProtocolQA, VCT-Text) to gauge true frontier capabilities in biological reasoning and problem-solving.
Computational Requirements: While the paper doesn't detail specific hardware, the analysis of training compute and inference scaling implies that deploying frontier models for expert-level biology tasks requires significant computational resources, both for training and inference (especially if using CoT or high reasoning settings). Cost-performance trade-offs should be carefully considered, and the efficiency of zero-shot prompting is a key factor here.

In summary, LLMs have made remarkable strides in understanding and reasoning about complex biological topics. Implementing these models for real-world biological applications is increasingly feasible, but practitioners must be mindful of selecting appropriate models, optimizing prompting strategies based on empirical performance and efficiency, and recognizing the current limitations of standard benchmarks for fully evaluating cutting-edge AI capabilities in this domain. Future implementations should likely move towards integrating LLMs with biological tools and databases to leverage their knowledge in more agentic workflows.

PDF Markdown

Tweets

https://twitter.com/BiologyAIDaily/status/1921903164292567327

https://twitter.com/BiologyAIDaily/status/1921903226368246189

https://twitter.com/lennijusten/status/1922357068373041288

https://twitter.com/XTXI/status/1921841974677475480

https://twitter.com/XTXI/status/1922342150081036672

https://twitter.com/XTXI/status/1925928276028973549

HackerNews

LLMs Outperform Experts on Challenging Biology Benchmarks (1 point, 0 comments)

LLMs Outperform Experts on Challenging Biology Benchmarks (2505.06108v3)

Summary

Related Papers

Tweets

HackerNews