Time-Budget Performance Scaling of LLMs on Simple Long-Horizon Tasks

Determine whether large language models exhibit disproportionately low performance gains from increased time budgets relative to humans on simple, long-horizon tasks, as opposed to the complex AI R&D tasks where this effect has been reported by METR.

Background

The paper references METR’s finding that LLMs gain far less in performance from increased time budgets compared to humans in complex domains such as AI R&D. However, the generality of this finding beyond highly complex tasks is explicitly identified as unclear.

To paper long-term coherence in a simpler setting, the paper introduces Vending-Bench, a simulated environment where agents operate a vending machine over extended horizons. The open question concerns whether the time-budget scaling trend observed by METR for complex tasks also persists for simpler long-horizon tasks.

References

METR, an AI safety organization focused on evaluating LLM, found that LLMs gain far less in performance from increased time budgets compared to humans. METR's investigation focused on very complex tasks (specifically, AI R{content}D), but it is not clear if this trend holds for more simple tasks.

— Vending-Bench: A Benchmark for Long-Term Coherence of Autonomous Agents (2502.15840 - Backlund et al., 20 Feb 2025) in Section 1 (Introduction)

Time-Budget Performance Scaling of LLMs on Simple Long-Horizon Tasks

Background

References

Related Problems