Time-Budget Performance Scaling of LLMs on Simple Long-Horizon Tasks
Determine whether large language models exhibit disproportionately low performance gains from increased time budgets relative to humans on simple, long-horizon tasks, as opposed to the complex AI R&D tasks where this effect has been reported by METR.
References
METR, an AI safety organization focused on evaluating LLM, found that LLMs gain far less in performance from increased time budgets compared to humans. METR's investigation focused on very complex tasks (specifically, AI R{content}D), but it is not clear if this trend holds for more simple tasks.
— Vending-Bench: A Benchmark for Long-Term Coherence of Autonomous Agents
(2502.15840 - Backlund et al., 20 Feb 2025) in Section 1 (Introduction)