Measuring AI Ability to Complete Long Tasks (2503.14499v2)

Published 18 Mar 2025 in cs.AI and cs.LG

Abstract: Despite rapid progress on AI benchmarks, the real-world meaning of benchmark performance remains unclear. To quantify the capabilities of AI systems in terms of human capabilities, we propose a new metric: 50%-task-completion time horizon. This is the time humans typically take to complete tasks that AI models can complete with 50% success rate. We first timed humans with relevant domain expertise on a combination of RE-Bench, HCAST, and 66 novel shorter tasks. On these tasks, current frontier AI models such as Claude 3.7 Sonnet have a 50% time horizon of around 50 minutes. Furthermore, frontier AI time horizon has been doubling approximately every seven months since 2019, though the trend may have accelerated in 2024. The increase in AI models' time horizons seems to be primarily driven by greater reliability and ability to adapt to mistakes, combined with better logical reasoning and tool use capabilities. We discuss the limitations of our results -- including their degree of external validity -- and the implications of increased autonomy for dangerous capabilities. If these results generalize to real-world software tasks, extrapolation of this trend predicts that within 5 years, AI systems will be capable of automating many software tasks that currently take humans a month.

Summary

The paper introduces the 50%-task-completion time horizon metric to quantify AI’s ability to complete long tasks, revealing a consistent ~7-month doubling time in performance.
It employs a four-part methodology, including a 170-task suite and human baselining, to robustly evaluate 13 frontier AI models using logistic regression.
Key insights highlight a significant gap between moderate and high reliability, with advancements in reasoning and tool use driving exponential improvements.

The paper "Measuring AI Ability to Complete Long Tasks" (2503.14499) proposes a metric, the 50%-task-completion time horizon, designed to quantify the capability of AI systems to perform complex, time-consuming tasks by relating their success rate to the time typically required by skilled humans for the same tasks. This metric specifically identifies the task duration (measured by typical human completion time) for which a given AI model achieves a 50% success rate. The research evaluates numerous frontier AI models using a diverse task suite and establishes a historical trend for this metric.

Methodology

The paper's methodology involves four key components: task suite development, human performance baselining, AI agent evaluation, and metric calculation.

Task Suite Construction:

A benchmark comprising 170 tasks was assembled to cover a wide spectrum of difficulties, ranging from tasks taking seconds to those requiring approximately 30 hours for humans. This suite integrated tasks from existing benchmarks with a novel set:

HCAST: 97 tasks involving software development, machine learning engineering, cybersecurity challenges, and complex reasoning problems, with human completion times ranging from 1 minute to 30 hours.
RE-Bench: 7 challenging ML research engineering tasks, each estimated to take around 8 hours for a human expert.
Software Atomic Actions (SWAA): A newly introduced set of 66 short, single-step software tasks (e.g., file selection, code completion, debugging simple errors), typically taking less than 1 minute for humans. SWAA was designed to provide finer granularity at the lower end of the difficulty spectrum and enable evaluation of less capable or earlier models.

Tasks were selected to be automatically scorable and were often structured into "families" (e.g., variations of crossword puzzles) to test generalization within related problem types. The tasks generally required less contextual information than typical large-scale real-world projects.

Human Baselining:

To establish the "length" or difficulty of each task, experienced professionals (average 5 years experience in software engineering, ML research, or cybersecurity) were timed while completing the tasks within the same evaluation platform (Vivaria) used for the AI agents. Over 800 human attempts were recorded, totaling 2,529 hours, yielding successful baselines for most tasks. The task length ( $t_{task}$ ) was defined as the geometric mean of the completion times across successful human attempts for that task. For tasks where no successful human baseline was obtained (21 tasks), manual estimates were used. Incentives were provided to encourage both success and speed among human participants.

AI Agent Evaluation:

Thirteen prominent AI models released between 2019 (GPT-2) and early 2025 (including Claude 3.7 Sonnet and a model referred to as "o1") were evaluated. The models were integrated into agentic frameworks, primarily using a generic ReAct-based scaffold (modular-public) equipped with tools like Python and Bash interpreters, a file editor, and web search capabilities. Critically, no task-specific prompting or fine-tuning was employed beyond basic instructions for the SWAA tasks. Each AI agent attempted each task multiple times (approximately 8 runs), and success was determined based on predefined, automatically verifiable scoring criteria, often calibrated to represent human-level performance.

Metric Calculation:

The core metric, the 50%-task-completion time horizon ( $h_{model}$ ), was derived using a logistic regression model inspired by Item Response Theory (IRT). For each AI model, the probability of successfully completing a task was modeled as a function of the task's difficulty (log human time, $\log t_{task}$ ) and two model-specific parameters: the time horizon ( $h_{model}$ ) and a slope parameter ( $\beta_{model}$ ). The relationship is given by:

$p_{success}(model, task) = \sigma((\log h_{model} - \log t_{task}) \times \beta_{model})$

where $\sigma$ is the sigmoid function. The parameter $h_{model}$ represents the task length ( $t_{task}$ ) at which the model predicts a 50% success probability ( $p_{success} = 0.5$ ). The model fitting process weighted tasks by the inverse square root of their "family" size to mitigate the influence of large groups of similar tasks.

Key Findings and Trends

The paper yielded several significant quantitative results regarding AI capabilities on long-horizon tasks.

Current Capabilities:

As of early 2025, frontier models like Claude 3.7 Sonnet and "o1" demonstrate a 50%-task-completion time horizon of approximately 50-60 minutes on the HCAST+RE-Bench+SWAA task suite. This implies these models can reliably succeed (50% chance) on tasks that typically require a skilled human about an hour to complete within the benchmark's constraints.

Exponential Growth Trend:

A striking finding is the consistent exponential growth of the 50% time horizon since 2019. The analysis indicates a doubling time of approximately 7 months (212 days, 95% CI: 171-249 days). This trend appears robust across various methodological choices and task subsets (Figures 1, fig:multiverse-boxplot in the paper). There is some preliminary evidence suggesting a possible acceleration of this trend in 2024-2025, though it may be within the noise range.

Reliability Gap:

The paper also computed an 80%-task-completion time horizon to assess higher levels of reliability. While this metric also exhibits an exponential trend with a similar doubling time (~213 days), the absolute horizon is significantly shorter. For instance, Claude 3.7 Sonnet's 80% time horizon is around 15 minutes, roughly 5 times shorter than its 50% horizon. This highlights a substantial gap between achieving moderate success and achieving high reliability on complex tasks, indicating that while capability ceilings are rising, ensuring consistent performance remains a major challenge.

Performance Correlation:

Across models and tasks, AI success rates show a strong negative correlation with the logarithm of human completion time ( $R^2 \approx 0.83$ ), reinforcing the choice of log human time as a suitable proxy for task difficulty in the IRT model.

Drivers of Improvement

Qualitative analysis and comparative failure analysis (e.g., between GPT-4 and "o1") suggest that the observed improvements in time horizon are primarily driven by advancements in several areas:

Enhanced Logical Reasoning and Code Generation: Models are becoming better at complex reasoning steps and generating correct, functional code.
Improved Tool Use: More effective utilization of provided tools (like code execution and file system interaction) is crucial for complex task completion.
Increased Reliability and Adaptability: Newer models exhibit better self-awareness, including the ability to recognize failures, adapt strategies, and avoid repeating unproductive action sequences. This was quantitatively supported by analysis showing newer models were less prone to certain types of repetitive failures.

Limitations and External Validity

The authors explicitly discuss several limitations and concerns regarding the generalizability of their findings to real-world AI application scenarios.

Task Representativeness:

The benchmark tasks, while diverse, differ systematically from many real-world tasks. They are automatically scorable, typically require less extensive context than large software projects, involve no interaction with human collaborators or other agents, and operate in static environments. While analysis on "messier" subsets within the benchmark showed similar trends, absolute performance was lower, suggesting that performance on truly complex, unstructured real-world problems might be lower than indicated by the benchmark results. Comparison with SWE-Bench Verified showed a similar exponential trend but a faster doubling time, potentially due to differences in human time estimation methods.

Human Baseline Fidelity:

The human baseline times, while carefully collected, are subject to noise due to small sample sizes for some tasks. Defining task length based only on successful human attempts might bias the difficulty measure. Furthermore, factors like participant skill variability, the specific incentives used, and the relatively low-context nature of the tasks for the human baseliners (compared to a project maintainer) could affect the accuracy and representativeness of the human time benchmarks. Internal experiments suggested AI performance aligned better with times from contractors possessing low project-specific context rather than high-context project maintainers.

AI Evaluation Constraints:

The paper used generic agent scaffolds with limited model-specific prompt engineering ("elicitation"). More tailored approaches could potentially unlock higher performance. Additionally, the evaluations employed limited inference-time compute (e.g., not heavily relying on techniques like best-of-k sampling or tree-of-thought exploration), meaning the measured horizons might represent a lower bound on the models' potential capabilities.

Forecasting Uncertainty:

While the observed 7-month doubling trend is consistent over several years, its extrapolation into the future is inherently uncertain. Factors such as breakthroughs in agent architectures, dedicated training for agency, shifts in compute scaling laws, fundamental algorithmic progress limits, and the potential recursive impact of AI assisting in AI R&D could all significantly alter the trajectory. The paper presents a naive extrapolation suggesting AI could automate tasks currently taking humans a month (approx. 167 work hours) between late 2028 and early 2031, but explicitly flags this as dependent on the trend holding and generalizing to real-world software tasks.

Conclusion

The paper introduces the 50%-task-completion time horizon as a valuable metric for tracking AI progress on complex tasks in human-relatable terms. The finding of a consistent ~7-month doubling time for this metric since 2019 provides a strong quantitative signal of rapid capability advancement in frontier models, primarily driven by improvements in reasoning, tool use, and reliability. However, significant challenges remain, particularly in achieving high reliability (80% horizon lags considerably), and the external validity of these benchmark results for predicting performance on complex, high-context, real-world tasks requires cautious interpretation due to methodological limitations and differences between the benchmark and deployment environments.

PDF Markdown

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Related Papers

Find Related Papers

Authors (25)

First 10 authors:

Tweets

https://twitter.com/METR_Evals/status/1955747420324946037

https://twitter.com/nikolaj2030/status/1954248757513720297

https://twitter.com/polynoamial/status/1921618590022725802

https://twitter.com/flowersslop/status/1952076783857139967

https://twitter.com/emollick/status/1902446048343437588

https://twitter.com/preshing/status/1954578413143888175