Crashing Waves vs. Rising Tides: Preliminary Findings on AI Automation from Thousands of Worker Evaluations of Labor Market Tasks

Published 1 Apr 2026 in cs.AI and econ.GN | (2604.01363v1)

Abstract: We propose that AI automation is a continuum between: (i) crashing waves where AI capabilities surge abruptly over small sets of tasks, and (ii) rising tides where the increase in AI capabilities is more continuous and broad-based. We test for these effects in preliminary evidence from an ongoing evaluation of AI capabilities across over 3,000 broad-based tasks derived from the U.S. Department of Labor O*NET categorization that are text-based and thus LLM-addressable. Based on more than 17,000 evaluations by workers from these jobs, we find little evidence of crashing waves (in contrast to recent work by METR), but substantial evidence that rising tides are the primary form of AI automation. AI performance is high and improving rapidly across a wide range of tasks. We estimate that, in 2024-Q2, AI models successfully complete tasks that take humans approximately 3-4 hours with about a 50% success rate, increasing to about 65% by 2025-Q3. If recent trends in AI capability growth persist, this pace of AI improvement implies that LLMs will be able to complete most text-related tasks with success rates of, on average, 80%-95% by 2029 at a minimally sufficient quality level. Achieving near-perfect success rates at this quality level or comparable success rates at superior quality would require several additional years. These AI capability improvements would impact the economy and labor market as organizations adopt AI, which could have a substantially longer timeline.

Abstract PDF Upgrade to Chat

Authors (9)

Summary

The paper demonstrates that AI automation in labor tasks follows a rising-tide model with steady, broad improvements instead of abrupt surges.
It uses an extensive, expert-annotated dataset (11,000+ tasks) and logistic regression to reveal a flat success-duration relationship across task durations.
The findings imply that near-term automation will improve universally for text-based tasks, allowing gradual labor market adaptation despite integration challenges.

Empirical Characterization of AI Automation: Rising Tides Prevail in Labor Market Task Evaluations

Introduction and Theoretical Framing

"Crashing Waves vs. Rising Tides: Preliminary Findings on AI Automation from Thousands of Worker Evaluations of Labor Market Tasks" (2604.01363) provides the most comprehensive empirical study to date on the trajectory of AI automation potential across realistic labor-market tasks addressable by LLMs. The authors distinguish between two archetypes of AI advancement: the "crashing waves" model, in which capabilities surge abruptly for subsets of tasks—echoing findings from prior software/benchmark-driven studies—and the "rising tides" model, where progress is broad-based and capabilities increase continuously across the task space. Their central inquiry tests which dynamic currently dominates in the automation of text-based labor tasks using a large-scale, expert-annotated dataset.

Figure 1: Visual comparison of "crashing wave" vs "rising tide" dynamics in AI automation: abrupt vs. broad-based performance gains.

Data, Evaluation Protocol, and Task Construction

Leveraging the O*NET taxonomy, the study curates 11,000+ text-addressable tasks (filtered via GPT-4 for at least 10% automation potential), amounting to a dataset that is orders of magnitude more representative and diverse than existing benchmarks. For each task, multiple realistic instances are generated and rated by U.S.-based workers with relevant on-the-job experience. Over 40 LLMs, spanning proprietary and open-weight models from multiple vintages and scales, produce completions that are assessed via a standardized nine-point manager acceptance rubric. The core automation indicator—a score ≥7 ("minimally sufficient, no edits required")—is the primary operationalization of "task automation." The task duration spectrum is broad, ranging from minutes to multiple weeks.

Figure 2: Histogram of human-reported task durations, demonstrating broad coverage from sub-hour tasks to week-long assignments.

Main Results: Success-Duration Relationship

The study’s primary empirical innovation is the systematic estimation of the relationship between AI task success and log human completion time, pooling across models. Across the full dataset, the logistic regression slope linking log duration to success rate is shallow (e.g., $\beta \approx -0.31$ for the no-edits sufficiency threshold), with a tenfold task duration increment corresponding to only a 7.6 percentage point decrease in success at a 60% baseline acceptance rate. Higher quality thresholds (≥8 or =9) yield downward-shifted but similarly flat curves.

Figure 3: Probability curves of LLM automation (various acceptance thresholds) against log task duration, with binned averages overlaying the modeled relationships.

This empirical result directly contrasts with previously reported "crashing waves" on software benchmarks, where logistic slopes are an order of magnitude steeper. It indicates that, for real-world labor tasks, current LLM performance does not exhibit abrupt capability cliffs with respect to duration/complexity.

Heterogeneity: Job Families, Model Scale, and Vintage

Task-level automation is decomposed by O*NET job families, revealing that while slope coefficients and baseline success rates vary by sector, the success-duration relationship remains flat or nearly so in the majority of families. Slope heterogeneity is significant—personal care and service exhibits $\beta = -0.93$ , while most technical/knowledge sectors cluster near zero—suggesting domain-specific task structure impacts sequential dependency, but not the general rising-tide pattern.

Large models ( $>$ 100B parameters) outperform small models primarily on short-duration tasks, producing an outward-rotated success-duration curve, while newer model vintages, at fixed scale, shift the entire curve in parallel—contrastingly improving short and long tasks uniformly.

Figure 4: Estimated success-duration logistic fits by model size and vintage, showing outward rotation for larger scale but parallel upward shifts for newer models.

Alternative specifications, including occupation and model fixed effects, reaffirm these patterns (Figure 5).

Temporal Dynamics: Success Trajectories and Doubling Times

Success rates for frontier models have increased rapidly and evenly across the duration spectrum in the observation window (2024-Q2 to 2025-Q3). The estimated "doubling time," defined as the period required for the feasible (50% success) task duration to double, is approximately 3.8 months—on the upper end of, or faster than, prior technical benchmarks [kwa2025measuring]. Critically, this rate is not associated with abrupt jumps for any subset of tasks, but with steady, broad-based gains.

Figure 6: Projected success rate trajectories for frontier models by task duration, with solid lines indicating ranges well-supported by current data.

Figure 7: Evolution of achievable duration for defined success thresholds, illustrating parallel and rapid extension of tasks with high automation probability.

Modeling alternative link functions (logistic vs. complementary log-log) or stricter acceptance thresholds only slightly alters quantitative outcomes (Figure 8).

Exemplification: LLM Output Evaluation

The authors include concrete examples of both poor and strong LLM-generated responses for illustrative tasks, highlighting the differential impact of current system limitations on nuanced or multi-constraint tasks.

Figure 9: Example of a deficient LLM response (inadequate check-splitting for a restaurant scenario).

Figure 10: Example of a high-quality, manager-acceptable LLM response to the same scenario.

Implications for AI Automation Forecasting and Labor Market Dynamics

Practical Implications

The results imply that, conditional on current extrapolation rates, LLMs could achieve 80–95% minimal-sufficiency-level success on the vast majority of text-based tasks (median 2.5h duration) by 2029. The absence of steep logistic tails means progress—while rapid—will be visible and gradual across domains, providing labor market actors more adaptation time than would be expected under a "crashing waves" paradigm. However, the time to reach near-perfect automation, or to automate tasks with very low current success rates, is likely to stretch further into the future due to the inherent flattening of the logistic function at high performance.

The authors underscore that these task-level rates should not be naïvely interpreted as immediate economic automation, owing to integration costs, contextually-dependent adoption barriers, and the last-mile problem. Additionally, the sample is biased toward more text-heavy, white-collar, and easier-to-survey occupations.

Theoretical Consequences

The rising-tide dynamic implies that model innovations, for realistic labor tasks, are more closely associated with broad improvements in robustness and serial chain-of-thought capabilities than with crossing algorithmic bottlenecks for specific task classes. The micro-foundation formalized in the appendix supports an interpretation where logistic slope is governed by the average serial step-coupling across domains, affected by task-type but not by model size or vintage per se.

Given a logistic (or even complementary log-log) structure, progress near the tails requires exponentially larger increments, making the achievement of perfect performance a distinct challenge—especially as tasks become less text-centric or require real-world interaction.

Limitations and Future Research Avenues

Limitations are acknowledged: (i) ongoing data collection may shift findings, (ii) tasks requiring physical manipulation or integrating multiple procedural/embodied steps are underrepresented, (iii) extrapolations assume linear technical progress. The interplay between task-automation rates and occupational re-structuring remains an open question. Adoption lags, “last mile” constraints, and the economic attractiveness of implementation will all modulate the labor market shock even with substantial underlying capability improvement.

Future work should extend these results as more tasks and occupations are incorporated, validate with observational firm-level automation data, and explicitly decompose process versus outcome-based evaluation.

Conclusion

This paper provides robust, domain-spanning empirical evidence that current and near-future LLM-driven AI automation manifests primarily as a rising tide in labor-market tasks: progress is broad, rapid, and predictable by duration, rather than concentrated in abrupt surges for a small subset of tasks. While the projected trajectory is highly consequential—potentially enabling high-coverage automation of text-centric knowledge labor within years—the road to flawless automation is significantly flatter and longer than crash-based models predict. These findings recalibrate both technical forecasting and policy anticipation, emphasizing continuous surveillance, dynamic workforce upskilling, and further study of real-world deployment frictions.

Markdown Report Issue

Paper to Video (Beta)

All Videos Subscribe on YouTube

Whiteboard

Crashing Waves vs. Rising Tides: Preliminary Findings on AI Automation from Thousands of Worker Evaluations of Labor Market Tasks

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

Overview

This paper asks: When AI gets better, does it suddenly master a few new tasks all at once like a crashing wave, or does it improve steadily across many tasks like a rising tide? Using thousands of real-world, text-based job tasks, the authors find the “rising tide” story fits best: AI is already good at many tasks and is getting better across the board, fairly quickly but mostly without sudden jumps.

The big questions

The study focuses on three simple questions:

Do AI systems improve in sudden bursts on certain tasks (“crashing waves”) or more smoothly and broadly across many tasks (“rising tides”)?
How does AI success change as tasks get longer or more complex (for humans)?
How fast are AI systems improving over time, and what might that mean for workers and the economy?

How they did the study

Think of a “task” as a piece of work a person might do on the job: write an email, summarize a report, draft a lesson plan, outline a legal memo, and so on. Here’s the approach in everyday terms:

Picking tasks: The team started from the U.S. Department of Labor’s O*NET list of tasks people do at work. They kept only tasks where a language-based AI could realistically help (for example, writing or analysis, not lifting boxes). They also focused on tasks where AI could save at least about 10% of a person’s time.
Creating examples: For each chosen task, they wrote up realistic examples (called “instances”)—like specific prompts or scenarios an AI could try to complete.
Running many AIs: More than 40 AI models attempted these task instances. This covered models of different sizes and release dates.
Human scoring: People with experience in those jobs graded the AI’s answers on a 1–9 scale:
- 7 means “minimally sufficient as-is” (a manager would accept it without edits).
- 8 means “average human quality as-is.”
- 9 means “superior quality as-is.”
Main measure: The key outcome was whether the AI’s answer needed no edits to be minimally acceptable (score ≥ 7). That’s treated as “AI can handle it.”
Task length: For each task instance, evaluators reported how long it would take a human to complete (from about 10 minutes to several days). The study then looked at how AI success changes as human task time gets longer.
Simple pattern check: They fit a smooth curve that shows how the chance of AI success changes with task length. A steep drop would look like a “crashing wave” (AI suddenly fails on tasks just a bit longer). A gentle, flatter drop would look like a “rising tide” (AI performance is more similar across short and long tasks).

Why “task length” matters: Longer tasks usually involve more steps in a row. If AI needs to get each step right, longer tasks give it more chances to slip up—so you’d expect success to fall as tasks get longer. The question is how sharply it falls.

What they found

Here are the main findings, explained simply:

Rising tide, not crashing waves: AI does a bit worse on longer tasks, but the drop is surprisingly small on average. In other words, AI performance is fairly similar for short and long text-based tasks. That’s a “rising tide” pattern, not big sudden jumps.
AI is already pretty capable: Across many tasks, AIs can produce minimally acceptable work (no edits needed) around half to three-quarters of the time, depending on the job area and model.
Fast improvement across the board:
- From mid-2024 to mid-2025, AI went from handling 3–4 hour tasks with a 50% success rate to handling about 1-week tasks at 50%.
- For tasks from 5 minutes to 24 hours long, success rates rose by roughly 8–11 percentage points over the study period.
- A practical way to picture it: the “longest task the AI can handle at a given success rate” roughly doubled every ~4 months.
Newer vs. bigger models:
- Newer models (released later) improve performance about equally across short and long tasks—a parallel lift.
- Bigger models (released at the same time as smaller ones) help more on short tasks than on long ones—the advantage fades as tasks get longer.
Different job areas, different slopes: Most job families show the same rising-tide pattern, but the strength varies. Some areas (like Personal Care and Service) show a bigger drop with longer tasks; others (like Office/Admin or Business/Finance) show smaller drops. Average success also varies (for example, “Legal” tasks had lower success on average).
Looking ahead (if trends continue): By around 2029, the authors project that AI could complete most text-related tasks with 80–95% success at a “minimally sufficient” level. Reaching near-perfect levels—or reaching the same success at “average” or “superior” quality—would likely take several more years.

Why this matters:

If improvement is a rising tide, individual workers are less likely to be blindsided by sudden, narrow bursts of automation. But because the tide lifts many tasks at once—and quickly—it can still be very disruptive across the whole labor market.

Limits, cautions, and context

A few important caveats keep this grounded:

Text-focused tasks only: The study tested tasks where LLMs make sense. It did not include purely physical tasks or tasks with no meaningful text component.
“No edits” is a minimal bar: “Minimally sufficient” means a manager would accept it as-is, not that it’s excellent. Higher bars (average or superior quality) are harder and have lower current success rates.
Not the same as whole jobs: Jobs are bundles of tasks, plus teamwork, accountability, real-world constraints, and “last mile” effort to fit outputs into complex workflows. Even if AI can do many tasks, turning that into full job automation takes longer and is costlier.
Future progress could slow: The projections assume recent improvement continues. But compute costs, hardware limits, and slower algorithmic gains could all slow progress. The authors present the 2029 forecast as an upper-bound scenario.

What this could mean for people and organizations

For workers: Expect broad, steady improvement in AI help across many text-based tasks—writing, analysis, planning, summarizing—rather than sudden jumps in just a few. That gives more time to adapt, but the pace is still fast enough to matter.
For managers and teams: AI can already handle a significant share of routine text tasks and is improving quickly. Planning for training, oversight, and quality control remains important, especially in areas with low tolerance for error.
For the economy: Even if AI capabilities rise fast, actual adoption takes time. Processes, tools, and rules must change. The bigger effect on jobs and productivity may show up over several years, not overnight.

Simple takeaway

AI progress looks less like a sudden crashing wave and more like a fast-rising tide. It is lifting performance across many kinds of text-based work at once. If recent trends continue, by the end of the decade AI could handle most such tasks at a “good enough” level, while truly top-tier, near-flawless performance will take longer. This gives people some time to adjust—but the tide is rising quickly, and it will touch a lot of jobs.

View Paper Prompt View All Prompts

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a concise list of what remains missing, uncertain, or unexplored in the study, framed to guide actionable follow‑up research.

Sampling bias from task pre‑filtering: Tasks are included only if GPT‑4 predicted ≥10% time savings and are text/partially text-based, likely inflating success rates and limiting representativeness. Future work should evaluate a random, unfiltered O*NET task sample and report how results change.
Generalizability beyond text tasks: Findings exclude non-text, multimodal, and physically embodied work components. Assess whether “rising tide” patterns hold when tasks require images, audio, code execution, sensors, or physical actions.
Coverage of task-duration tails: Extremely short (seconds) and very long (multi-week) tasks are scarce. The logistic slope in the tails—and the possibility of “crashing waves” at extremes—remains undetermined. Expand coverage to the tails and test slope stability.
Measurement of task duration: Human time-on-task is provided by evaluators and may be noisy or biased. Validate duration measures with independent benchmarks (e.g., time-use diaries, instrumented workflows, or archival logs) and quantify measurement error.
Subjective outcome metric and inter-rater reliability: “Manager acceptance without edits” is subjective and evaluator agreement rates are not reported. Collect inter-rater reliability statistics (e.g., Krippendorff’s alpha), use anchored vignettes, and calibrate raters across domains.
Binary success thresholds obscure edit costs: “No edits needed” does not quantify the time or difficulty of edits when outputs are “useful with edits.” Directly measure edit time/effort and incorporate “last‑mile” costs into success/automation estimates.
Single-shot vs. multi-run variance: Non-determinism in LLM outputs is not captured if only one sample is scored. Use multiple generations per model-task with fixed temperatures to estimate variance and tail risks.
Protocol transparency and prompt sensitivity: Details on prompts, chain-of-thought use, few-shot examples, system prompts, context window limits, and temperature settings are not fully specified. Release complete evaluation protocols and perform prompt-robustness ablations.
Tool use and realistic agent workflows: It is unclear whether models used retrieval, browsing, code execution, or plugins. Since real deployments do, re-run evaluations with standardized tool stacks to gauge the gap between “bare LLM” and tool-augmented performance.
Instance construction and realism: Task instances are constructed and later screened as “realistic,” but creation may introduce systematic biases (e.g., towards well-specified tasks). Audit instance-generation procedures and compare against naturally occurring artifacts (e.g., real reports/emails).
Model pooling and slope attenuation: Pooling across models and domains may flatten estimated slopes. Fit hierarchical/mixed-effects models and report model-specific slopes to test whether pooling conceals steepness for particular systems or domains.
Mechanisms behind job-family heterogeneity: Large slope differences across job families are reported, but causal mechanisms (e.g., sequential dependence, domain ambiguity, compliance constraints) are not tested. Link slope estimates to independently measured task-structure features.
Micro-foundation remains unvalidated: The mapping from the slope β to “number of sequentially dependent steps” is theoretical. Construct tasks with known step dependency profiles to empirically validate or refine the mapping.
Size vs. vintage contributions: The paper notes different patterns for larger models versus newer vintages but does not quantify their relative contributions or underlying causes (architecture, data, RL, post-training). Decompose gains with controlled comparisons and ablation studies.
Functional-form risk in projections: The time trend assumes a parallel upward shift (constant slope) in logistic space; alternative links (e.g., probit, cloglog, spline) could change forecasts materially, especially in the tails. Report model-averaged forecasts and stress tests across functional forms.
Trend-extrapolation uncertainty: Projections to 2029 presume recent improvement rates persist; potential deceleration from compute, data, or algorithmic limits is acknowledged but not modeled. Provide scenario analyses (e.g., plateau, slowdown, step-change) with confidence bands.
External validity to deployment: “Minimally sufficient” acceptance may not meet organizational, legal, or safety requirements in high-stakes domains. Incorporate domain-specific acceptance thresholds, compliance checks, and error cost models.
Error severity and risk profiles: Success probabilities do not distinguish benign from catastrophic errors or hallucinations. Classify error types and severities, and integrate risk-weighted performance metrics.
Economic and organizational adoption gap: The translation from capability to realized automation (costs, integration, workflow redesign, complementarity with workers) is not estimated. Conduct field studies to map capability curves to adoption timelines and net productivity effects.
Cross-period comparability: API changes, default safety layers, or context-window expansions across time may confound “vintage” effects. Control for platform changes and report sensitivity to API/version drift.
Small-sample job families: Some families (e.g., N≈40) yield wide confidence intervals and inconclusive slopes. Increase samples in underrepresented families to stabilize estimates.
Language, geography, and equity: The study targets U.S. O*NET tasks; non-English tasks and global contexts are not assessed. Test multilingual tasks and examine differential impacts across languages and regions.
Fairness and disparate impact: Task success by population served (e.g., legal, healthcare, social services) isn’t analyzed for bias. Evaluate subgroup performance and fairness constraints in domains with vulnerable populations.
Collaboration and hybrid workflows: The study evaluates full AI completion, not mixed human–AI teaming where AI drafts and humans refine. Measure productivity and quality in hybrid conditions and compare to full automation.
Cost-performance tradeoffs: Inference cost, latency, and throughput are not considered; economic feasibility at scale is unknown. Add cost and latency metrics to produce $/task and time-to-completion frontiers.
Release criteria and “frontier” definition: The set of frontier models changes over time; sensitivity to alternative frontier definitions is only briefly noted. Pre-register and test multiple frontier definitions to ensure robustness.
Reconciliation with METR findings: Differences with METR’s steeper curves are hypothesized (task types, duration coverage) but not experimentally resolved. Run matched-task head-to-head comparisons and meta-analyses to pinpoint sources of discrepancy.
Data and code availability: Public release plans for datasets, prompts, rubrics, and scoring tools are not stated. Open materials to enable replication, independent audits, and method extensions.

View Paper Prompt View All Prompts

Practical Applications

Overview

Drawing on the paper’s central finding—that AI automation is proceeding as a broad, “rising tide” across many text-based labor-market tasks rather than as sudden “crashing waves”—the applications below translate the results, methods, and observed improvement rates into concrete actions. Each item notes relevant sectors, potential tools/products/workflows, and key assumptions/dependencies that could affect feasibility.

Immediate Applications

These applications can be piloted or deployed now using current frontier models and the paper’s evaluation methodology.

Cross-functional “automation opportunity” audits using task-duration triage
- Sectors: enterprise operations (management, office/admin support), software, education, healthcare administration, finance, sales/marketing, media/content
- What: Inventory department workflows; estimate human task duration; prioritize LLM automation where current success rates are ≥50–70% at “minimally sufficient” quality (score ≥7). Use duration bands (e.g., 5–60 minutes; 1–4 hours; 4–24 hours) to set targets.
- Tools/workflows: “Duration-aware” automation planner integrated with ticketing (Jira, ServiceNow), RPA, and document systems; dashboards showing expected success by duration.
- Assumptions/dependencies: Tasks must be LLM-addressable (text or partially text); success thresholds align with manager expectations; reliable duration estimates exist; privacy/compliance needs are met.
Model selection policy: newer-vintage for broad gains; larger models for short tasks
- Sectors: all that deploy LLMs
- What: Adopt a policy that prioritizes newer models for across-the-board improvements (parallel shift), and routes ≤1-hour tasks to larger models for a stronger uplift on short tasks (outward rotation).
- Tools/workflows: Model router that considers task duration and target quality (≥7/≥8/9) to choose model size and vintage automatically.
- Assumptions/dependencies: Cost/performance trade-offs acceptable; access to multiple model families; latency constraints for larger models are manageable.
Human-in-the-loop QA calibrated to error tolerance and task duration
- Sectors: healthcare (documentation, prior auth), finance (memos, reconciliations), legal (drafts), education (rubrics/lesson plans), customer operations
- What: For domains with low error tolerance or longer-duration tasks, pair LLM outputs with lightweight QA checklists or manager sign-off; for short, routine tasks, accept “minimally sufficient” outputs with spot checks.
- Tools/workflows: Quality gates aligned to score ≥7/≥8/9; sampling-based audits; automated uncertainty flags; review SLAs by job family.
- Assumptions/dependencies: Clear definition of “minimally sufficient”; audit capacity; regulatory requirements for human oversight.
Task redesign to decompose long tasks into serial substeps
- Sectors: software (docs/tests/specs), management (reports/briefs), education (unit plans), research (summaries/lit scans), media (packages)
- What: Break multi-day tasks into coupled steps that map to higher LLM success on short segments; sequence with orchestration tools and insert QA after critical steps.
- Tools/workflows: Agentic planners; checklists that mirror sequential dependence; step-wise templates; “chain-of-steps” libraries.
- Assumptions/dependencies: Staff skilled in decomposition; orchestration reliability; data/context provisioning for each substep.
Organization-wide “acceptability thresholds” and governance for LLM outputs
- Sectors: policy, regulated industries (health, finance, legal), enterprise governance
- What: Define when “no edits required” at ≥7 is acceptable vs. when ≥8/9 is needed; align to risk profiles by job family (e.g., legal shows lower success, thus stricter thresholds).
- Tools/workflows: Policy matrices by domain and duration; automated enforcement in content pipelines; exception logging.
- Assumptions/dependencies: Risk appetite agreed; stakeholders trained; monitoring in place.
Workforce planning and training that anticipates broad-based improvement
- Sectors: HR across industries; public workforce agencies
- What: Use the study’s rising-tide pattern to prioritize upskilling in review, task decomposition, prompt engineering, and AI supervision across many roles rather than only a few.
- Tools/workflows: Role-by-role curricula; micro-credentialing for AI QA; “AI supervisor” training paths.
- Assumptions/dependencies: Training time and budget; employee buy-in; task mix remains text-heavy enough to benefit.
Procurement and budgeting based on projected improvement rates
- Sectors: enterprise IT, public sector
- What: Budget for faster ROI in short tasks (with larger models) and steady improvement for longer tasks (newer vintages); plan upgrade cadence (~3–6 months) to capture parallel shifts.
- Tools/workflows: Cost models tying API spend to duration-weighted gains; refresh schedules; A/B tests for each release.
- Assumptions/dependencies: Model pricing stability; access to new models; measurable KPIs tied to “minimally sufficient” acceptance.
Domain-specific pilots where success rates already strong
- Sectors: office/admin support (scheduling, correspondence), sales/marketing (briefs, sequences), healthcare support and practitioners (documentation), education (materials), installation/maintenance (text tasks)
- What: Launch production pilots for tasks commonly in the several-minutes-to-few-hours range where predicted success ≥60–70% already.
- Tools/workflows: Pre-approved prompt libraries; red-team tests; fallback to human.
- Assumptions/dependencies: Sufficient task volume; robust data security; change management readiness.
Academic replication and benchmarking using the paper’s methodology
- Sectors: academia, applied research labs
- What: Recreate the O*NET-mapped, worker-evaluated pipeline; estimate success–duration curves by domain; test alternative micro-foundations for the logistic slope (sequential dependence).
- Tools/workflows: Open task-instance repositories; evaluator panels; reproducible scoring protocols.
- Assumptions/dependencies: Access to evaluators; IRB/ethics for worker studies; funding for cross-model tests.
Policy monitoring dashboards for gradual capability shifts
- Sectors: labor departments, economic development, regulators
- What: Track success rates and failure-rate halving in priority occupations; identify domains with steeper slopes (e.g., personal care & service text-tasks) for proactive support.
- Tools/workflows: Occupational “automation exposure” dashboards; early-warning indicators for training/reskilling.
- Assumptions/dependencies: Continuous data collection; accepted metrics; stakeholder coordination.

Long-Term Applications

These require additional research, scaling, integration, or regulatory development—often leveraging the study’s projections (e.g., 80–95% minimally sufficient success by ~2029) and acknowledging uncertainties about compute/algorithmic slowdowns.

Duration-aware autonomous agents for end-to-end workflows
- Sectors: software (feature specs → tests → docs), finance (closing packs, compliance), healthcare admin (end-to-end prior auth), education (course design), media (multi-asset campaigns)
- What: Agents plan and execute multi-day workflows by decomposing into substeps, selecting models (size/vintage) per step, and inserting QA at error-intolerant points.
- Tools/products: Orchestration platforms with “serial-dependence” planners; acceptance-threshold routers; learned QA checkpoints.
- Assumptions/dependencies: Reliable task segmentation; tool-use integrations; alignment with domain regulations; improved long-duration performance as projected.
Sector-specific certification of AI-generated work at defined quality thresholds
- Sectors: healthcare, finance, legal, public sector
- What: Standards defining when AI outputs can be accepted without edits (≥7), when average/superior quality is required (≥8/9), and required human oversight.
- Tools/products: Certification audits; third-party conformity assessment; provenance tracking.
- Assumptions/dependencies: Regulator consensus; measurable, repeatable scoring; liability frameworks.
Dynamic labor policy “glidepaths” keyed to rising-tide trends
- Sectors: policy/labor economics
- What: Phase-in of benefits, mobility support, and reskilling tied to observed success-rate trajectories (rather than sudden shocks). Target job families with consistent gains.
- Tools/workflows: Trigger-based funding releases; regional training consortia; employer incentives for upskilling.
- Assumptions/dependencies: High-quality, timely monitoring data; political consensus; program evaluation capacity.
Enterprise-wide task-to-AI mapping systems rooted in O*NET taxonomies
- Sectors: large enterprises, HR tech vendors
- What: Maintain live catalogs mapping internal tasks to O*NET-like descriptors, with duration, error tolerance, and current model performance overlays to guide automation roadmaps.
- Tools/products: “Task graph” platforms; ROI and risk simulators; model upgrade impact forecasts.
- Assumptions/dependencies: Accurate task capture; change management; integration with HRIS/ITSM.
Research programs to improve long-duration task performance
- Sectors: AI R&D, academia, tool vendors
- What: Target methods that close the gap on longer, serially dependent tasks (e.g., improved planning/RL, memory, tool-use, verification); quantify how slope relates to task structure.
- Tools/workflows: Benchmarks reflecting realistic multi-step tasks; longitudinal evaluations by job family.
- Assumptions/dependencies: Sustained compute and algorithmic progress; data availability for end-to-end tasks.
Industry-specific AI QA and incident reporting ecosystems
- Sectors: regulated industries, critical infrastructure, public sector
- What: Shared registries for AI errors, near-misses, and mitigations—by duration band and domain—to refine acceptance thresholds and oversight.
- Tools/products: Incident databases; safety pattern libraries; continuous assurance services.
- Assumptions/dependencies: Legal protection for reporting; standardized taxonomies; cultural adoption.
Education system redesign for “AI supervisor” competencies
- Sectors: K–12, higher education, vocational training
- What: Curricula that teach task decomposition, acceptance criteria, uncertainty detection, and domain-specific QA aligned to rising-tide adoption across disciplines.
- Tools/products: Micro-credentials; assessment rubrics tied to ≥7/≥8/9 thresholds; capstones on orchestrating multi-step AI work.
- Assumptions/dependencies: Standards bodies’ buy-in; teacher training; equitable access to models.
Macro-planning for compute, privacy, and compliance infrastructure
- Sectors: government, hyperscalers, large enterprises
- What: Investment roadmaps anticipating broad-based AI use in text-heavy workflows (storage, secure data interfaces, audit trails) and potential slowdowns in hardware/algorithmic progress.
- Tools/workflows: Capacity forecasting models; privacy-enhancing tech; cost-containment via caching/distillation.
- Assumptions/dependencies: Budget stability; regulatory clarity; evolution of model licensing/costs.
Adaptive compensation and job design
- Sectors: enterprise HR, gig platforms
- What: Redesign roles to focus on oversight and value-added tasks; compensation models recognizing productivity increases from AI assistance and time reallocation from short to complex work.
- Tools/workflows: Task-mix tracking; performance metrics tied to QA and throughput; internal marketplaces for decomposed tasks.
- Assumptions/dependencies: Labor relations; fair attribution of AI contributions; safeguards against over-automation.
Longitudinal occupational forecasting grounded in success–duration curves
- Sectors: labor economists, think tanks, policymakers
- What: Forecast employment shifts by combining observed slopes and progress rates with task mixes and “last-mile” costs to predict timing/scale of impacts.
- Tools/workflows: Open models integrating success rates, failure-rate halving times, and adoption lags.
- Assumptions/dependencies: Stable or transparently updated improvement rates; credible adoption models; data on firm integration costs.

Notes on Key Assumptions and Dependencies

Applicability is limited to text-based or partially text-based tasks; purely physical tasks are out of scope.
“Minimally sufficient” (score ≥7) is not synonymous with average or superior quality (≥8/9); domains with low error tolerance require stricter thresholds and stronger oversight.
Task sampling was pre-filtered for ≥10% time-savings potential via GPT-4, which may bias toward LLM-relevant tasks.
Human evaluator judgments introduce variance; organizations should calibrate scoring to their standards.
Extrapolations (e.g., 80–95% minimally sufficient by ~2029) assume continued recent capability growth; potential slowdowns in compute, hardware, or algorithms could extend timelines.
Adoption depends on “last-mile” integration costs, data access, privacy/security, and change management—factors that can materially lag capability.
Differences across job families are meaningful; legal tasks show lower success levels, while some families (e.g., personal care & service text tasks) exhibit steeper duration effects, requiring customized strategies.

View Paper Prompt View All Prompts

Glossary

binned scatter: A plotting approach that groups observations into bins to display average relationships in noisy data. "Binned scatter points summarize the raw data."
complementary log-log specification: A statistical link function alternative to the logit for modeling binary outcomes, often used when event probabilities are near 0 or 1. "These results are robust to using alternative functional forms, such as a complementary log-log specification (Appendix Figure \ref{start_prob_projection_logit_vs_cloglog})."
doubling time: The time required for a capability metric to double; here, how quickly feasible task duration grows for a fixed success rate. "When estimating a linear-trend model across all periods, the implied "doubling time," or the calendar time between model releases needed for newer models to achieve the same success rate on tasks which are twice as long, equals 3.8 months and is estimated with relatively high precision."
failure-rate halving time: The time it takes for the failure probability (1 − success rate) to be reduced by half. "Based on these curves, we approximate failure-rate halving times (the failure rate is 1 minus the success rate), which equal 2.4â3.2 years over this period."
frontier models: The most capable or latest-generation AI models at a given time used for state-of-the-art evaluation. "We estimate Eq. \eqref{eq:logit_reg_time} using only frontier models (see figure notes)."
job family: A group of occupations with related activities, used here per the O*NET taxonomy. "Automation within particular "job families" (e.g., management or community and social service) also follows the same rising-tide pattern in most cases."
last mile costs: The additional costs and effort required to take a system from adequate to fully deployable or near-perfect performance. "As discussed later, these findings will not translate directly to shares of job automation, because of sampling issues, "last mile" costs (\cite{fleming2024last}), and other reasons."
linear-trend model: A model that adds a term linear in time (or release date) to capture systematic temporal improvement. "When estimating a linear-trend model across all periods, the implied "doubling time," or the calendar time between model releases needed for newer models to achieve the same success rate on tasks which are twice as long, equals 3.8 months and is estimated with relatively high precision."
LLM-addressable: Tasks that can be attempted or assisted by LLMs because they are text-based. "We test for these effects in preliminary evidence from an ongoing evaluation of AI capabilities across over 3,000 broad-based tasks derived from the U.S. Department of Labor O*NET categorization that are text-based and thus LLM-addressable."
log-odds: The logarithm of the odds p/(1−p), a common scale for coefficients in logistic models. "Coefficients are shown as log-odds on the figure."
log-spaced time bins: Bins whose boundaries increase geometrically, used to evenly cover orders of magnitude in time. "we partition task instances into 40 equally sized, log-spaced time bins and compute success rates and sample sizes within each bin."
logistic CDF: The cumulative distribution function of the logistic distribution, used as the link function in logistic regression. "Here, $\Lambda(\cdot)$ denotes the logistic CDF and $\alpha$ is a constant."
logistic curve: An S-shaped function describing probabilities as a function of predictors; here, success versus task duration. "For crashing waves, this relationship can be well described by a steep logistic curve."
logistic model: A regression model for binary outcomes using the logistic CDF as the link function. "Our main specification estimates the following logistic model:"
logit: The inverse of the logistic function, mapping probabilities to the real line; also the canonical link for logistic regression. "Because $\text{logit}(0.60)\approx0.405$ and $\frac{1}{1+\exp(-(0.405-0.31))}\approx0.524$ ."
maximum likelihood: A method of parameter estimation that maximizes the likelihood of observed data under the model. "We estimate Eq. \eqref{eq:logit_reg} by maximum likelihood."
micro-foundation: A theoretical modeling rationale that explains an empirical relationship in terms of underlying mechanisms. "In Section \ref{theory_subsection}, we provide one possible micro-foundation for Eq. \eqref{eq:logit_reg} under which the slope coefficient $\beta$ admits a structural interpretation: it can be mapped to the number of sequentially dependent steps required to complete a task."
model vintage: The release cohort or time period of a model, used to distinguish improvements due to newer generations. "The performance gains from increasing model size are different than those from newer model vintages."
non-deterministic: Describing tasks or benchmarks where multiple valid outputs or stochastic elements exist, reducing predictability. "By contrast, we focus on non-deterministic, realistic, and representative labor-market tasks."
O*NET: The U.S. Department of Labor’s Occupational Information Network, a taxonomy of occupations, tasks, and skills. "over 3,000 broad-based tasks derived from the U.S. Department of Labor O*NET categorization"
reinforcement learning: A training paradigm where agents learn by receiving feedback signals (rewards) for actions, potentially improving task sequencing. "A natural interpretation is that improving longer-duration tasks is more demanding than improving short-duration tasks --- and in particular that long-duration tasks, even if they are ultimately sequences of coupled short-duration ones (see Section \ref{theory_subsection}), could require additional training / reinforcement learning over how to combine them."
serial dependence: Dependence across sequential steps, where later parts of a task rely on earlier parts. "Task duration can plausibly relate to the serial dependence of tasks: longer tasks may require completing more coupled sequential sub-steps."
sigmoidal: Having an S-shaped form; characteristic of logistic-type curves in probability space. "Because the release-date term, $\delta R_m$ , enters additively in the logit specification of Eq. \eqref{eq:logit_reg_time}, it implies a sigmoidal path in probability space."
standard errors clustered by participant: An adjustment to variance estimates that accounts for within-cluster correlation in residuals at the participant level. "Standard errors are clustered by participant in parentheses."

Crashing Waves vs. Rising Tides: Preliminary Findings on AI Automation from Thousands of Worker Evaluations of Labor Market Tasks

Summary

Empirical Characterization of AI Automation: Rising Tides Prevail in Labor Market Task Evaluations

Introduction and Theoretical Framing

Data, Evaluation Protocol, and Task Construction

Main Results: Success-Duration Relationship

Heterogeneity: Job Families, Model Scale, and Vintage

Temporal Dynamics: Success Trajectories and Doubling Times

Exemplification: LLM Output Evaluation

Implications for AI Automation Forecasting and Labor Market Dynamics

Practical Implications

Theoretical Consequences

Limitations and Future Research Avenues

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

Overview

The big questions

How they did the study

What they found

Limits, cautions, and context

What this could mean for people and organizations

Simple takeaway

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Practical Applications

Overview

Immediate Applications

Long-Term Applications

Notes on Key Assumptions and Dependencies

Glossary

Open Problems

Continue Learning

Collections

Tweets