Translation of AI R&D benchmark performance to real‑world productivity

Ascertain how directly performance on AI research‑and‑development capability benchmarks (such as SWE‑Bench, MLE‑Bench, RE‑Bench, and PaperBench) translates into productivity boosts in real‑world AI R&D workflows, accounting for integration frictions.

Background

The authors note rapid improvements on AI R&D‑relevant benchmarks and growing adoption of coding and research assistants, but emphasize that benchmarks may not reflect actual workplace gains due to integration challenges. This gap complicates efforts to use benchmark progress as a leading indicator for automation impacts.

Resolving this uncertainty would calibrate expectations for AIRDA’s effect on output, inform investment in tools and processes, and help interpret evaluation results in operational contexts.

References

Rapidly improving benchmark results indicate at least some progress \citep{jimenez_swebench_2023,chan_mlebench_2025,starace_paperbench_2025,wijk_rebench_2025,anthropic_system_2026},\footnote{At the time of writing, the most advanced model had a 80\% success rate on tasks that take human expert coders 1 hour and 10 minutes to complete \citep{metr_measuring_2025}.} but it is unclear how directly such results translate to productivity boosts given real-world integration frictions \citep{becker_measuring_2025,dellacqua_navigating_2023,brynjolfsson_generative_2023,noy_experimental_2023,narayanan_ai_2025,becker_we_2026}.

Measuring AI R&D Automation  (2603.03992 - Chan et al., 4 Mar 2026) in Section 1 (Introduction)