Translation of LLM benchmark performance to novice physical laboratory performance

Determine whether strong performance by frontier large language models on biological knowledge and protocol benchmarks (e.g., Virology Capabilities Test and LAB-Bench) translates into improved novice human performance when executing hands-on procedures in physical biology laboratories, including multi-step workflows modeling viral reverse genetics.

Background

The paper motivates its randomized controlled trial by noting that LLMs perform strongly on biological benchmarks but real-world utility for novices in physical laboratory settings is uncertain. Benchmarks typically assess factual knowledge and short-horizon tasks in digital environments and do not capture tacit knowledge or feasibility of execution in wet labs.

This open problem underpins the study’s design: evaluating whether access to frontier LLMs materially improves novice execution of tasks modeling a viral reverse genetics workflow. The authors’ findings suggest modest uplift in some tasks but no significant improvement in end-to-end workflow completion within the study timeframe, reinforcing that the broader question remains unresolved and warrants further investigation.

References

Yet, whether this translates to improved human performance in the physical laboratory remains unclear.

Measuring Mid-2025 LLM-Assistance on Novice Performance in Biology  (2602.16703 - Hong et al., 18 Feb 2026) in Abstract