Remote Labor Index (RLI) Benchmark
- Remote Labor Index (RLI) is an evaluative benchmark that measures AI-driven automation in diverse remote freelance projects across 23 market categories.
- It employs a rigorous methodology using real Upwork projects, incorporating metrics like average project cost ($633), completion time, and a manual expert review process.
- Current AI models achieve minimal automation, with rates between 0.8% and 2.5%, thereby highlighting the substantial gap between AI performance and human freelancer standards.
The Remote Labor Index (RLI) quantifies the extent to which AI agents can automate real-world, economically valuable remote freelance work. Developed as an empirical benchmark, the RLI leverages authentic projects sourced from global online labor markets, evaluating the practical economic impact of AI on diverse sectors of remote work. Its methodology, evaluation pipeline, and principal findings offer a unique data-driven foundation for interpreting the relationship between AI progress and labor automation (Mazeika et al., 30 Oct 2025).
1. Definition and Scope
The Remote Labor Index is an evaluative benchmark designed to measure AI-driven automation in remote freelance economies. The RLI diverges from conventional agent benchmarks—often limited to narrow technical tasks—by representing a broad spectrum of freelance projects with economic grounding. It consists of 240 projects, each sampled from actual remote labor platforms (primarily Upwork), spanning 23 distinct categories such as design, architecture, marketing, data analysis, and creative fields.
Each project comprises three essential components: a textual brief detailing client requirements, input files necessary for task completion, and a gold-standard deliverable created by an experienced freelancer. The data is directly tied to market transactions, with associated realized costs (average: \$633; median: \$200 per project) and reported completion times (average: 29 hours; median: 11.5 hours). The full dataset encapsulates over \$140,000 in labor value and 6,000+ hours of work.
2. Benchmark Construction and Project Selection
Projects are sourced directly from verified Upwork freelancers to ensure both authenticity and representativeness. Supplementary data is drawn from long-tail, non-Upwork labor markets, with explicit permissions granted for research use. Rigorous protocols for cleaning, anonymizing, and standardizing datasets are enforced, including redaction of personally identifiable information and proprietary material.
The RLI is characterized by pronounced diversity and complexity: its projects are over twice as long and difficult to complete as those in prior AI benchmarks, and they require handling a wide array of file types (including code, multimedia, 3D models, CAD files, etc.). For evaluation uniformity, projects necessitating non-remote, interactive, or ungradable deliverables are excluded.
| Attribute | Details |
|---|---|
| Total Projects | 240 |
| Project Categories | 23 (e.g., design, data, writing, architecture) |
| Avg/Median Project Cost | \$633 / \$200 |
| Avg/Median Completion Time | 29 hours / 11.5 hours |
| Total Labor Value | \$140,000+ |
3. Agent Evaluation Protocol and Core Metrics
AI agents are evaluated under identical conditions as human freelancers—they access the original brief and input files and are tasked with generating a deliverable in an appropriate format. There are no architectural constraints: agents may leverage command-line interfaces, interactive environments, or integrated toolchains as needed.
Deliverables are scored through a manual, expert-driven evaluation pipeline, utilizing a specialized open-source web platform capable of seamless review across complex, multimodal files. Raters operate from a client-centered perspective, judging whether outputs would be accepted by a “reasonable client,” eschewing granular rubrics in favor of holistic acceptance.
Four principal metrics govern the benchmark:
- Automation Rate: The fraction of projects on which an AI deliverable is judged at least as good as the human reference. Human evaluators assign a score on a 3-point scale; scores of “2” (equal to human) or “3” (superior) count as automation.
- Elo Score: A relative progress measure determined by fitting a Bradley-Terry model to human pairwise preference data, with human performance anchored at 1,000 Elo and 400-point gaps yielding 10:1 odds. These scores facilitate longitudinal tracking of agent progress and model-to-model comparison.
- Dollars Earned: The sum total of market value for all projects an agent automates, reflecting tangible economic impact.
- Autoflation: A measure of cost deflation, representing the percentage reduction in total project cost if the cheapest available method (AI or human) is adopted project-by-project:
4. AI Performance on the RLI Benchmark
Empirical results indicate that current frontier AI agents achieve only minimal automation. The highest-performing agent, Manus, attained an automation rate of 2.5%. All other tested models—such as Grok 4, Sonnet 4.5, GPT-5, ChatGPT agent, and Gemini 2.5 Pro—demonstrated automation rates ranging from 0.8% to 2.1%. Economic impact, as measured by “Dollars Earned,” remains low: the best models cumulatively capture less than \$2,000 in value across the entire \$140,000 dataset. Elo scores corroborate these findings, showing all AI agents far below the human baseline.
| Model | Automation Rate |
|---|---|
| Manus | 2.5% |
| Grok 4 | 2.1% |
| Sonnet 4.5 | 2.1% |
| GPT-5 | 1.7% |
| ChatGPT agent | 1.3% |
| Gemini 2.5 Pro | 0.8% |
A detailed error analysis reveals that failures predominantly stem from file errors (18%), incompleteness (36%), sub-professional quality (46%), and inconsistencies (15%). AI successes are typically confined to creative domains (e.g., audio/image generation, basic writing, or data scraping). Inter-annotator agreement on pass/fail ratings for automation is 94.4%, demonstrating evaluation pipeline robustness.
5. Economic and Research Significance
The RLI serves as a direct, economically calibrated measure of AI automation impact. By grounding project selection in real freelance market dynamics—with explicit cost and labor-time references—it offers empirical clarity on the translation of AI progress into economic automation. Unlike knowledge benchmarks or synthetic tasks, the RLI's client-oriented, end-to-end evaluation enforces high standards for what constitutes genuine automation.
For researchers and policy stakeholders, the RLI tracks both absolute (automation rate, “Dollars Earned”) and incremental (Elo) progress. It provides an evolving dataset to monitor the intersection of AI capabilities and labor substitution, with implications for economic forecasting, labor policy, and AI safety research. Its design is sensitive to early gains as well as potential step changes in agent ability.
6. Limitations and Exclusion Criteria
The RLI currently excludes jobs requiring in-person execution, team or client interaction, or outputs not amenable to standardized evaluation. It covers 23 out of 64 Upwork categories, thus not representing all remote labor domains. Cost and time figures are static snapshots, not dynamically inflation-adjusted. As AI systems advance, evaluation may demand greater domain expertise; current automation rates may subsequently underestimate potential future performance.
A plausible implication is that coverage and assessment complexity will need periodic revision as both AI and remote work ecosystems evolve.
7. Role in Policy and Future Directions
By empirically anchoring debates surrounding AI labor automation, the RLI provides policymakers and economists with a common evidentiary baseline for intervention design and monitoring. Its capacity to distinguish relative from absolute progress helps avoid both exaggerated claims and undue complacency regarding AI's labor market impact. The RLI is constructed to be updatable, enabling the tracking of automation trends over time as system capabilities and the remote work market co-evolve.
Current results indicate that even frontier AI systems remain essentially unable to automate the broad, complex array of human remote labor—less than 3% of evaluated tasks can be considered ‘automated’ under client-viable standards. The RLI thus functions as a stable, economically salient reference for ongoing analysis and discourse as AI operationalization progresses (Mazeika et al., 30 Oct 2025).