Remote Labor Index: Measuring AI Automation of Remote Work
This presentation introduces the Remote Labor Index, a groundbreaking benchmark that measures AI's ability to automate real remote work by testing AI agents on 240 authentic freelance projects spanning 23 professional categories. Unlike previous benchmarks that rely on simplified tasks, RLI uses actual work projects with deliverables valued at over $140,000 and representing more than 6,000 hours of human professional labor. The results reveal a stark reality: current frontier AI agents automate at most 2.5% of economically valuable remote work, exposing a dramatic gap between AI performance on research benchmarks and the ability to perform complex, real-world professional tasks.Script
Most AI benchmarks test narrow skills on simplified tasks. But what happens when you measure AI against the messy, complex reality of actual professional work? The Remote Labor Index answers that question with 240 real freelance projects that human professionals completed for paying clients.
The researchers built this benchmark by sourcing actual freelance projects from online labor platforms. These aren't toy problems. They're game development, architectural design, data analysis, marketing campaigns. Each project includes the original brief, input files, and the gold-standard deliverable that a human professional created for a real client.
Evaluating whether an AI can do real work turns out to be as complex as the work itself.
Because these projects produce complex, varied deliverables, automated evaluation is simply not feasible. The researchers developed a rigorous manual evaluation process where trained evaluators compare AI outputs to human deliverables, judging from the perspective of a reasonable client who hired the freelancer. Would you accept this work? Would you pay for it? The process achieves over 94% agreement between evaluators, ensuring reliability.
Here's the stark finding: the best AI agent tested automated just 2.5% of these real-world projects. Frontier models that dominate research benchmarks fail on 97 out of every 100 authentic professional tasks. The gap between benchmark performance and real economic value is enormous.
AI agents succeed only on tasks that fit their current strengths: generating images, writing simple code, editing audio. But most remote work demands robust multimodal reasoning, attention to detail, and the ability to synthesize multiple requirements into a coherent whole. Current systems lack these capabilities, leading to technical failures, quality issues, and deliverables that miss the mark.
The economic impact metric, called autoflation, measures the reduction in cost when AI completes projects successfully at lower cost than humans. The result? Less than 2% cost reduction. Even when AI agents occasionally succeed, the aggregate economic impact is negligible. The promise of widespread labor automation remains far from reality.
This benchmark matters because it anchors discussions of AI and labor in economic reality, not hype. It shows that current AI systems, despite impressive performance on narrow benchmarks, cannot autonomously complete the vast majority of economically valuable remote work. For researchers, it highlights the capabilities AI still lacks. For policymakers, it offers a tool to monitor progress and anticipate labor market impacts with evidence, not speculation.
The researchers highlight a critical insight: unlike past automation technologies, AI is being developed to capture general cognitive skills. If successful, this means AI could eventually automate new job categories as they emerge. But that future is not here yet. Current agents need fundamentally better reasoning, reliability, and adaptability before they can tackle the breadth of human professional work.
The Remote Labor Index shows us where we actually stand in the automation story: at the very beginning, with 97.5% of real work still requiring human skill, judgment, and creativity. To explore this research further and create your own presentations, visit EmergentMind.com.