Does the time-budget performance trend extend to simpler tasks?
Determine whether the reduced performance gains from increased time budgets observed for complex AI R&D tasks also hold for simpler, long-running tasks, in order to isolate and assess long-term coherence capabilities independently of task complexity.
References
METR's investigation focused on very complex tasks (specifically, AI R{content}D), but it is not clear if this trend holds for more simple tasks.
— Vending-Bench: A Benchmark for Long-Term Coherence of Autonomous Agents
(Backlund et al., 20 Feb 2025) in Introduction