Measuring AI Ability to Complete Long Tasks

This presentation explores a groundbreaking metric for evaluating AI capability: the 50%-task-completion time horizon. By measuring how long tasks AI can reliably complete compared to human experts, researchers reveal that AI's time horizon has been doubling every 7 months over the past 6 years. Through a diverse suite of 170 real-world tasks, the study demonstrates exponential progress in AI autonomy, showing frontier models now handle tasks taking humans up to 50 minutes with 50% reliability, with profound implications for workforce automation and AI safety governance.
Script
Imagine asking an AI to complete a task that takes you an hour. How confident are you it will succeed? This paper introduces a revolutionary way to measure AI capability by asking a deceptively simple question: how long a task can AI reliably complete compared to human experts?
Building on that question, the authors developed a precise metric called the time horizon. This measures the maximum task duration where AI achieves 50% reliability, directly compared to how long skilled humans take for the same work.
Let's examine how they actually measured this capability.
To ground their findings, the researchers assembled 170 carefully selected tasks from existing benchmarks plus newly created software actions. They tested both human experts and AI agents on identical work, then applied statistical modeling to find each model's 50% success threshold.
The central finding is striking: over the last 6 years, the length of tasks AI can reliably complete has doubled every 7 months. This exponential trend reveals consistent, predictable progress in AI autonomy, with frontier models like Claude 3.7 Sonnet now achieving 50% success on tasks taking humans up to 50 minutes.
Drilling into the results, three key improvements drive AI progress: better reasoning, more sophisticated tool use, and enhanced reliability. However, success rates decline sharply as task length increases, and real-world messiness still poses significant challenges compared to clean benchmark tasks.
The authors didn't stop at their core suite. They validated their metric against independent benchmarks and deliberately messy tasks that mirror real-world chaos, finding that while absolute performance drops on messier work, the fundamental trend of exponential improvement holds steady.
The researchers acknowledge important boundaries to their work. Their task suite, while diverse, cannot capture every facet of AI capability, and they observed unexplained performance gaps between task categories that suggest we're still learning how to measure machine intelligence comprehensively.
Extrapolating this 7-month doubling trend forward paints a striking picture: within 5 years, AI systems could reliably handle tasks currently taking humans a full month. This projection provides essential scaffolding for policymakers and organizations preparing for increasingly autonomous AI in economically critical domains.
By giving us a precise lens to track AI's expanding time horizon, this research transforms how we understand and anticipate machine capability. Visit EmergentMind.com to explore the full methodology and implications.