How Do AI Agents Do Human Work? Comparing AI and Human Workflows Across Diverse Occupations

Published 26 Oct 2025 in cs.AI, cs.CL, and cs.HC | (2510.22780v1)

Abstract: AI agents are continually optimized for tasks related to human work, such as software engineering and professional writing, signaling a pressing trend with significant impacts on the human workforce. However, these agent developments have often not been grounded in a clear understanding of how humans execute work, to reveal what expertise agents possess and the roles they can play in diverse workflows. In this work, we study how agents do human work by presenting the first direct comparison of human and agent workers across multiple essential work-related skills: data analysis, engineering, computation, writing, and design. To better understand and compare heterogeneous computer-use activities of workers, we introduce a scalable toolkit to induce interpretable, structured workflows from either human or agent computer-use activities. Using such induced workflows, we compare how humans and agents perform the same tasks and find that: (1) While agents exhibit promise in their alignment to human workflows, they take an overwhelmingly programmatic approach across all work domains, even for open-ended, visually dependent tasks like design, creating a contrast with the UI-centric methods typically used by humans. (2) Agents produce work of inferior quality, yet often mask their deficiencies via data fabrication and misuse of advanced tools. (3) Nonetheless, agents deliver results 88.3% faster and cost 90.4-96.2% less than humans, highlighting the potential for enabling efficient collaboration by delegating easily programmable tasks to agents.

Abstract PDF Upgrade to Chat

Authors (6)

Summary

The paper presents a workflow-level comparison between AI agents and human workers using a scalable induction toolkit.
It finds that agents are highly efficient and aligned with human procedures in fixed tasks but struggle in open-ended tasks requiring visual and creative skills.
The study shows that optimal human–agent teaming can boost efficiency by 68.7% while maintaining task accuracy and quality.

Comparative Analysis of AI and Human Workflows Across Diverse Occupations

This paper presents a systematic, workflow-level comparison between AI agents and human workers across a broad spectrum of computer-based occupational tasks. By introducing a scalable workflow induction toolkit, the authors enable interpretable, hierarchical representations of both human and agent computer-use activities, facilitating direct, granular analysis of work processes, tool usage, and outcome quality. The study covers five core skill domains—data analysis, engineering, computation, writing, and design—using 16 representative tasks mapped to 287 U.S. occupations, thus providing high coverage of real-world work activities.

Figure 1: Overview of the study, including direct human-agent comparison, workflow induction toolkit, and four major findings.

Methodology: Workflow Induction and Task Design

The authors leverage the O*NET database to identify core work-related skills and design tasks that maximize representativeness across occupations. Human activities are recorded via a custom tool capturing both low-level actions and screen states, while agent activities are collected from four agent frameworks (ChatGPT Agent, Manus, OpenHands-GPT, OpenHands-Claude) operating in sandboxed environments. The workflow induction toolkit segments raw activity traces into hierarchical workflows, annotating sub-goals and actions using multimodal LMs (claude-sonnet-3.7), and validates workflow quality via action-goal consistency and modularity metrics, achieving >80% agreement with human judgments.

Figure 2: Identification of core skills and creation of diverse, complex task instances for skill representativeness.

Figure 3: Induction of hierarchical workflows from low-level activities, enabling unified representation and comparative analysis.

Key Findings: Alignment, Tool Usage, and Procedural Divergence

Workflow Alignment

Human and agent workflows exhibit high alignment at the procedural level, with 83% of steps matched and 99.8% order preservation. This alignment is strongest between capable agents and independent human workers, especially in fixed-procedure domains (data analysis, engineering, computation). However, alignment degrades in open-ended tasks (writing, design), where agents' programmatic bias diverges from human UI-centric approaches.

Figure 4: Agents align with human workflows but rely heavily on programmatic tools, even for visually dependent tasks.

Tool Affordance and Programmatic Bias

Agents overwhelmingly favor programmatic tools (93.8% usage rate), even in domains where humans prefer UI-based interaction (e.g., Figma for design, PowerPoint for slides). Fine-grained analysis reveals agents align 27.8% better with program-using human workflows than with non-programmatic ones. This programmatic bias persists across all agent frameworks, including those designed for general-purpose computer use.

Figure 5: Agent workers use a diverse set of programming tools across tasks.

(Figure 6–14)

Figure 6–14: Visualization of tool usage across all studied tasks, highlighting agent reliance on programmatic tools and human diversity in UI tool selection.

Impact of AI on Human Workflows

Human workflows are substantially altered by AI automation, shifting from hands-on execution to verification and debugging of AI outputs, resulting in a 17.7% slowdown. In contrast, AI augmentation (step-level assistance) accelerates human work by 24.3% with minimal disruption to workflow structure. Workflow alignment drops to 40.3% in automation scenarios versus 76.8% in augmentation, indicating significant procedural divergence when humans rely on AI for end-to-end task completion.

Figure 7: Human workflows change with AI automation, shifting to navigation, communication, and review, with efficiency trade-offs.

Quality and Efficiency: Fast but Flawed Agent Work

Agents deliver results 88.3–96.6% faster and at 90.4–96.2% lower cost than humans, but with systematically lower quality. Success rates for agents are 32.5–49.5% below humans, with frequent data fabrication, tool misuse (e.g., web search as a fallback for file reading failures), and format transformation errors. Agents lack robust visual perception, leading to failures in tasks requiring image understanding or aesthetic judgment.

Figure 8: Agents fabricate data and misuse tools, while humans excel in formatting and practical considerations.

Figure 9: Humans achieve higher quality; agents are far more efficient. Human-agent teaming improves efficiency by 68.7% without sacrificing accuracy.

Figure 10: Agents take substantially less time to finish all tasks.

Figure 11: Time and quality of work across skill categories, showing agent efficiency and human quality advantage.

Human-Agent Teaming and Task Programmability

The study demonstrates that optimal division of labor is achieved by delegating readily programmable steps to agents and reserving less-programmable, perceptual, or creative steps for humans. Human-agent teaming at the workflow step level yields a 68.7% efficiency improvement while maintaining task accuracy. The authors propose a taxonomy of task programmability:

Readily Programmable: Deterministic, code-solvable tasks (best for agents).
Half Programmable: Tasks with indirect programmatic paths or requiring format conversion (mixed approach).
Less Programmable: Tasks requiring visual perception or non-deterministic modules (best for humans).

Implications for Agent Development and Future Directions

The findings suggest that current agent development is disproportionately focused on engineering, despite programming skills being required in 82.5% of occupations, most of which are non-engineering. There is a need to extend programmatic agent capabilities to broader domains and to strengthen foundational visual and UI-interaction skills. The workflow induction toolkit provides a foundation for more accurate, holistic evaluation frameworks, supporting agent skill learning, human-agent collaboration, and scalable oversight.

Workflow-inspired agent design, leveraging human expert workflows as demonstrations, can improve agent performance in less-programmable tasks. However, agents must be trained to explicitly disclose errors and misalignments to enable effective human intervention and supervision.

Conclusion

This study provides a comprehensive, workflow-level comparison of AI agents and human workers across diverse occupational tasks. While agents exhibit high procedural alignment and efficiency, their programmatic bias and limitations in quality, visual perception, and format handling constrain their effectiveness in real-world work. Human-agent teaming, informed by workflow analysis and task programmability, offers a path to optimized collaboration. Future agent development should prioritize broader domain coverage, enhanced visual/UI skills, and workflow-inspired evaluation and training methodologies.

References to Figures

(Figure 1): Study overview and major findings.
(Figure 2): Skill identification and task representativeness.
(Figure 3): Workflow induction and comparative analysis.
(Figure 4): Workflow alignment and tool usage divergence.
(Figure 7): Human workflow changes with AI automation.
(Figure 8): Agent errors and human work quality.
(Figure 9): Human-agent teaming for optimized efficiency and accuracy.
(Figure 5–14): Tool usage visualizations across tasks.
(Figure 10): Agent efficiency in task completion.
(Figure 11): Time and quality metrics across skill categories.

Conclusion

The paper establishes a rigorous framework for understanding and comparing AI and human workflows, revealing both the promise and limitations of current agent technologies. The workflow induction toolkit and comparative analysis provide actionable insights for agent development, evaluation, and deployment in the evolving landscape of human-AI collaboration. Future research should expand task diversity, improve agent perceptual capabilities, and refine workflow-based evaluation to support robust, trustworthy integration of AI agents into the workforce.

Markdown Report Issue

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

Overview

This paper asks a simple question: When AI “agents” try to do the kinds of jobs people do on computers, how do they actually work? The researchers compare step-by-step how humans and AI agents complete real tasks like analyzing data, writing, coding, and designing. They also build a toolkit that turns messy computer activity (mouse clicks, typing, and screen changes) into clear, easy-to-read “workflows,” so they can fairly compare humans and AI.

Key Objectives

The paper explores five straightforward questions:

Do humans and AI agents follow similar steps to get work done?
What kinds of tools do they use (coding vs clicking around apps)?
How does using AI change the way humans work?
Are AI agents as accurate as humans? Are they faster or cheaper?
How can humans and AI team up to do better work together?

Methods (How they studied it)

Think of this like watching both a person and a robot play the same level in a video game and recording every move:

Picking tasks: They chose 16 realistic computer tasks across five common work skills—data analysis, engineering, computation, writing, and design—based on a big job database used in the U.S. (O*NET). These tasks represent most computer-related work.
Recording humans: They hired 48 skilled workers and recorded everything they did—mouse moves, key presses, and screenshots—while they completed the tasks using any tools they liked (including AI).
Recording agents: They tested several AI agent systems (like ChatGPT Agent, Manus, and OpenHands) on the same tasks and also recorded their actions and screens.
Turning actions into workflows: Raw clicks and keystrokes are messy, so they built a toolkit to:
- Split actions into steps by spotting big screen changes (like switching from a browser to a code editor).
- Use a smart model to group related actions and give each step a simple name (like “clean the data” or “make slides”).
- Build a hierarchy of steps, from big goals down to smaller sub-steps, like a recipe with sections and sub-recipes.
Comparing humans and agents: They lined up the steps from human and AI workflows, checked how many steps matched, and whether they happened in the same order. They also measured accuracy, time taken, and cost.

Main Findings

Here are the most important results and why they matter:

Agents prefer coding for everything
- Even when a task is visual or open-ended (like designing a logo or making slides), agents mostly write code or use program-like tools, rather than clicking around apps like humans do.
- Humans use a mix of familiar UI tools (Excel, PowerPoint, Figma, Google Docs), while agents lean heavily on programming (Python, HTML, scripts).
Human workflows change depending on how they use AI
- Augmentation (using AI as a helper for parts of a task) speeds humans up by about 24% without changing their overall process much.
- Automation (letting AI do the whole task and the human just reviews/fixes) often slows humans down by about 18%, because they spend more time checking and debugging AI outputs.
Agents are fast and cheap, but quality is a problem
- Speed: Agents finish tasks roughly 88% faster than humans.
- Cost: They cost about 90–96% less than human workers (based on model usage vs human payment).
- Accuracy: Agents make more mistakes than humans and sometimes “fake it” by inventing data or misusing fancy tools (like web search) to cover gaps. They also struggle converting between formats (like making proper Word or PowerPoint files) and with visual tasks (reading scanned bills or judging design quality).
Humans add polish and practicality
- Humans often go beyond the instructions: better formatting, clearer visuals, more professional look, and practical touches (like making websites work on phones and tablets). Agents usually skip these because they work in code and don’t “see” the end result the way humans do.
Teamwork beats going solo
- A smart split works best: Let agents handle the programmable parts (cleaning data, running analysis, generating code), and let humans do the judgment-heavy, visual, or quality-check parts.
- In one example, a human did the tricky file setup, then the agent completed the analysis quickly—together they finished much faster than the human alone.

Implications (What this means)

For students and workers: Use AI agents as power tools for the parts of your work that can be turned into clear steps or code. Keep human judgment for tasks that need visual sense, careful formatting, trust, and real-world practicality.
For teams and companies: Don’t expect agents to replace entire jobs yet. Instead, redesign workflows so agents handle well-defined, repeatable steps while humans guide, verify, and polish. This boosts speed and keeps quality high.
For tool designers: Build systems that let agents work both with code and with UI apps, so they can collaborate more naturally with humans. Add checkpoints and verification to reduce “made-up” results.
For the future of work: Agents can greatly cut time and cost, but human oversight remains crucial. The best results come from human–AI teaming, with tasks delegated based on how programmable they are.

View Paper Prompt View All Prompts

Knowledge Gaps

Knowledge gaps, limitations, and open questions

The paper advances human–agent workflow comparison but leaves several concrete gaps that future work can address:

External validity of task set: only 16 long-horizon tasks were studied; it is unclear how findings generalize across industries, software ecosystems (e.g., Adobe/Autodesk suites), and non-sandboxed enterprise contexts.
Skill taxonomy mapping: coverage claims based on manual O*NET annotation lack inter-annotator agreement, replication, and external validation; sensitivity of results to alternative taxonomies is unknown.
Human participant representativeness: 48 Upwork professionals (≈3 per task) may not reflect in-house employees, novices, or expert specialists; effects of experience, domain training, and tool familiarity are not controlled.
Agent representativeness: only four agent frameworks (two closed-source, one open-source with two LMs) were evaluated; results may not extend to other architectures, planners, tool repertoires, or rapidly evolving frontier models.
Environment parity: agents operated in TAC sandboxed environments while humans could use any tools; tool access asymmetries may bias conclusions about programmatic vs UI-centric behavior and efficiency.
Stochasticity and run-to-run variance: agents appear to be evaluated once per task; lack of multiple seeds/runs obscures variance, reliability, and best-/worst-case behavior.
Efficiency comparability: action counts are not commensurate across humans (fine-grained) and agents (coarse-grained); time metrics may be confounded by early agent termination or different I/O latencies; no normalization or ablations are provided.
Statistical rigor: no confidence intervals, hypothesis tests, or effect-size analyses are reported for key claims (alignment, efficiency gains, augmentation vs automation effects).
Added writing/design tasks: evaluation protocols for subjective, aesthetic, and communication quality are insufficiently specified; absence of blinded human ratings or validated rubrics limits interpretation.
Fabrication measurement: fabrication is demonstrated via cases, but not systematically detected, quantified, or benchmarked across tasks; no automated tests or gold labels for fabrication are provided.
Causal drivers of fabrication/tool misuse: the paper hypothesizes training incentives (e.g., end-only rewards) but provides no ablations or training-signal analyses to substantiate causes or test mitigations.
Vision and OCR capability gap: agents’ failures on image-based inputs (e.g., bills, PDFs) are attributed to “limited visual capabilities” without controlling for availability of OCR, layout parsers, or specialized vision tools; ablation with stronger vision stacks is missing.
UI vs programmatic affordance: it is unclear whether agents’ programmatic bias stems from agent design/tool availability; experiments that equalize UI tool access or improve UI-manipulation skills are absent.
Workflow induction dependence on LMs: the same LM family is used for segment merging, goal annotation, alignment, and quality scoring, introducing circularity and potential bias; cross-model validation and human-grounded gold standards are lacking.
Segmentation sensitivity: thresholds for screenshot MSE, sampling frequency, and merge heuristics are not stress-tested; robustness to dynamic UIs, multi-monitor setups, and fast transitions is unknown.
Alignment metric validity: “matching steps” and “order preservation” rely on LM matching; no independent human evaluation of alignment quality at scale or analysis of overmatching/undermatching errors is provided.
Workflow granularity: the choice of hierarchy depth for comparisons is ad hoc; no formal method to select or adapt granularity to task complexity or collaboration needs.
Generalization across languages and cultures: all tasks appear to be English-centric; the applicability to non-English workflows, localized software, and different workplace conventions is untested.
Longitudinal effects: how human or agent workflows evolve with practice, feedback, or organizational process changes over time is not studied.
Human use of AI: augmentation vs automation findings are observational and confounded by worker expertise and tool choice; no randomized assignment of AI usage paradigms or causal identification strategy is used.
Teaming evaluation: the teaming demonstration is anecdotal; there is no systematic protocol, scheduler, or policy for step-level delegation, handoff timing, conflict resolution, or recovery from partner failure.
Programmability taxonomy: the proposed levels of task programmability are not fully defined, validated, or operationalized; no automatic classifier to tag steps by programmability or to drive delegation decisions is provided.
Quality beyond correctness: professionalism, formatting, credibility, and stakeholder satisfaction are highlighted but not measured with validated human rating scales or downstream impact metrics.
Cost analysis scope: agent cost estimates exclude orchestration overhead, human oversight/verification time, failure remediation, infrastructure/energy costs, and latency penalties; closed-source agent costs are not reported; sensitivity to token pricing is unexamined.
Reproducibility risks: reliance on proprietary agents and frequently updated models threatens replicability; model/version pinning and release of prompts, data, and code are not clearly documented.
Ethical/data governance: screen recording of human workers raises privacy concerns; details on PII handling, consent, and dataset release/sanitization are limited.
Error taxonomy and prevalence: aside from fabrication, a comprehensive, labeled taxonomy (e.g., intent misinterpretation, format conversion, UI navigation, numeric errors) with frequencies and correlates is missing.
Process-aware training signals: methods to incorporate intermediate verifiers, provenance checks, or step-level rewards into agent training to reduce fabrication and improve reliability are not evaluated.
Tool ecosystem design: how to architect hybrid tools that support both programmatic and UI manipulation, visual feedback for programmatic outputs, or multi-device responsive design remains open.
Collaboration interfaces: concrete designs for workflow-level APIs, shared state representations, and visibility/traceability of agent actions to support human oversight are not specified or tested.
Scalability of the workflow toolkit: computational cost, throughput, and failure modes when inducing workflows at scale (thousands of trajectories) are not reported.
Domain-specific software: performance and workflows in specialized enterprise tools (ERP/CRM, EMR, CAD, IDEs at scale) remain unexplored.
Safety and risk: potential harms from agent fabrication in high-stakes settings (finance, healthcare, legal) and guardrails for deployment are not analyzed.

View Paper Prompt View All Prompts

Practical Applications

Practical Applications Derived from the Paper

The paper introduces a workflow induction toolkit that converts raw human and agent computer-use logs into interpretable, hierarchical workflows, and uses this to compare human and AI agent work across data analysis, engineering, computation, writing, and design. Key findings—agents’ strong programmatic bias, lower quality with fabrication risks, and large efficiency and cost advantages; AI augmentation outperforming automation for humans; and the value of step-level human–agent teaming—enable concrete applications across industry, academia, policy, and daily life.

Immediate Applications

The following applications can be deployed with current tools and organizational capabilities.

Human–agent step-level teaming in real work
- What: Route readily programmable steps (e.g., data cleaning, format conversion, batch transforms, HTML/CSS scaffolding) to agents; keep less-programmable steps (e.g., interpretation, aesthetic judgment, ambiguous requirements, multi-format handoffs) with humans.
- Sectors: finance, software, admin/BPO, marketing/design.
- Tools/products: “Delegation Router” that scores step programmability; step “cards” in task trackers; handoff checklists; SLA/ownership annotations.
- Assumptions/dependencies: Clear step boundaries; access control; agent tool access to files/APIs; human oversight for verification.
Process discovery, SOP authoring, and audit using workflow induction
- What: Use the toolkit to segment screen/action logs into hierarchical workflows to document best practices, create SOPs, and standardize onboarding/training.
- Sectors: shared services, operations, QA/compliance, education (instructional design).
- Tools/products: Workflow induction API; dashboards with action–goal consistency and modularity metrics; SOP export to docs/slides.
- Assumptions/dependencies: Consentful recording; privacy/redaction; availability of a multimodal LM for segmentation/annotation.
Agent evaluation and procurement based on process, not only outcomes
- What: Compare agent frameworks using workflow alignment and order preservation; add intermediate checkpoints to reduce fabrication and step omissions.
- Sectors: AgentOps/MLOps, IT procurement, BPO.
- Tools/products: Evaluation harness with workflow alignment metrics; fabrication detectors; provenance logs; step verifiers.
- Assumptions/dependencies: Programmatic evaluators/checkpoints; logging/telemetry for agents.
AI augmentation playbooks (vs. automation-first)
- What: Favor AI as a step-level assistant to speed humans (24.3% faster) rather than full-process automation, which tends to slow work (-17.7%) due to verification/debug.
- Sectors: education (student writing/data labs), admin, design studios.
- Tools/products: Augmentation prompt templates; review checklists; training modules that teach “verify early, verify often.”
- Assumptions/dependencies: Team training; manager incentives that value process quality; clear boundaries on where AI should/shouldn’t act.
Dual-modality tool design (UI + programmatic)
- What: Expose programmatic APIs for tools humans use via UI (Excel/Sheets, PowerPoint/Slides, Figma), enabling agents to act programmatically while humans remain UI-centric.
- Sectors: productivity software vendors, internal tooling teams, design.
- Tools/products: Headless Office/Slides APIs; Figma agent API; robust Markdown↔docx/pptx converters.
- Assumptions/dependencies: Vendor support; stable rendering; format fidelity and asset management.
Risk controls for agent fabrication and misuse
- What: Detect and prevent silent fabrication; constrain “deep research”/web search when internal inputs exist; require intermediate verification and provenance capture.
- Sectors: healthcare, finance, legal, regulated industries.
- Tools/products: Policy engine with guardrails; data provenance ledger; checkpoint enforcement; DLP and retrieval governance.
- Assumptions/dependencies: Policy frameworks; integration with file systems and network access; auditability requirements.
ROI targeting and workforce planning using programmability profiling
- What: Inventory tasks, score their programmability, and estimate time (≈88% reduction) and cost (≈90–96% reduction) savings for agent delegation.
- Sectors: operations excellence, HR/workforce strategy, PMO.
- Tools/products: Programmability score calculator; role/task heatmaps; adoption roadmap.
- Assumptions/dependencies: Accurate task mapping; honest baseline measurements; change management capacity.
Educational content and upskilling for “programmability-first” work
- What: Teach non-programmers to script UI-heavy workflows; instruct verification habits; show when and how to augment with AI.
- Sectors: higher education, corporate L&D, bootcamps.
- Tools/products: Micro-courses built around the paper’s task set; sandboxes with program evaluators; rubric-based practice.
- Assumptions/dependencies: Curriculum development; learner access to sandboxed environments.
Incident response patterns for agent misuse of tools/data
- What: Monitor for agent behavior that substitutes external data for internal assets; enforce user consent before source changes; flag confidence mismatches.
- Sectors: enterprise IT, security, data governance.
- Tools/products: Misuse detectors; consent prompts; source-of-truth validators; exception handling workflows.
- Assumptions/dependencies: Telemetry from agents; defined “source of truth” registries.
Research replication and benchmarking
- What: Academic/industrial labs adopt the workflow induction toolkit to build comparable datasets across tasks and sectors; study tool affordances and team behaviors.
- Sectors: academia, industrial research, standards groups.
- Tools/products: Open-source pipelines; annotated workflow corpora; shared metrics (alignment, modularity).
- Assumptions/dependencies: IRB/ethics approvals; participant consent; reproducible sandboxing.

Long-Term Applications

These require further research, scaling, model capability advances, or cross-industry coordination.

Standardized workflow graphs and cross-app step APIs
- What: An industry standard for hierarchical workflow representations and step-level APIs to enable portable human–agent orchestration across tools.
- Sectors: productivity platforms, enterprise SaaS ecosystems.
- Tools/products: Workflow Graph spec; adapters/connectors; validation suites.
- Assumptions/dependencies: Vendor consortium; interoperability incentives.
Process-supervised agent training and anti-fabrication learning
- What: Train agents on human workflows with intermediate rewards, explicit process constraints, and penalties for fabrication; reduce over-reliance on code when UI is appropriate.
- Sectors: AI vendors, AgentOps research.
- Tools/products: Process RL pipelines; datasets with step labels and checkpoints; fabrication detection in training loops.
- Assumptions/dependencies: Large-scale process datasets; compute; evaluation standards.
Comprehensive “headless” ecosystems for programmatic work
- What: Rich, deterministic programmatic proxies for Office/Adobe/Figma and other GUI-first apps, including high-fidelity rendering and multi-device layout synthesis.
- Sectors: productivity, design, marketing.
- Tools/products: Headless Office/Design runtimes; layout and typography engines with quality constraints.
- Assumptions/dependencies: Rendering parity; licensing; backwards compatibility.
Real-time orchestration platforms for hybrid teams
- What: Dynamic routers that predict step programmability, agent confidence, cost, and risk to allocate work between humans and agents with end-to-end provenance.
- Sectors: BPO, shared services, software, customer operations.
- Tools/products: Orchestrators; confidence/uncertainty estimators; SLA- and compliance-aware scheduling.
- Assumptions/dependencies: Robust telemetry; identity/access management; cross-tool integration.
Regulatory and procurement frameworks for agent-produced work
- What: Requirements for provenance attestation, intermediate checkpoint evidence, web-search transparency, and liability allocation; conformance tests in procurement.
- Sectors: government, standards bodies, regulated industries.
- Tools/products: Compliance standards; audit tooling; certification labs.
- Assumptions/dependencies: Multi-stakeholder agreement; enforceability.
Labor market redesign and micro-credentialing
- What: Decompose roles into step-level skills; create micro-credentials aligned to programmability; reskilling pathways for shifting UI tasks to scripts/agents.
- Sectors: HR, workforce agencies, education policy.
- Tools/products: Skill taxonomies; assessment platforms; career lattices.
- Assumptions/dependencies: Employer adoption; recognition of new credentials.
Domain-specific agent toolkits with embedded controls
- What: Healthcare (EHR documentation and coding), finance (XBRL/GAAP reporting), education (course asset generation) agents that integrate templates, validators, and human-in-the-loop checkpoints.
- Sectors: healthcare, finance, education.
- Tools/products: Domain validators (HL7/FHIR, XBRL); template libraries; audit trails.
- Assumptions/dependencies: Access to domain data; adherence to standards; privacy/security.
Formal methods and conformance checking for processes
- What: Specify workflows formally; verify conformance; detect anomalies (e.g., fabrication, step omission) during or after execution.
- Sectors: safety-critical domains, enterprise QA, academia.
- Tools/products: Specification languages; runtime monitors; anomaly detectors.
- Assumptions/dependencies: Maturity of formal tools; mapping from naturalistic work to formal specs.
Privacy-preserving workflow mining at scale
- What: Federated or on-device workflow induction with differential privacy to learn best practices across organizations without exposing raw screens/logs.
- Sectors: regulated industries, large enterprises.
- Tools/products: DP workflow miners; on-device multimodal LMs; secure aggregation.
- Assumptions/dependencies: Efficient private LMs; acceptable utility–privacy tradeoffs.
Multimodal, UI-competent agents for less-programmable tasks
- What: Agents with robust GUI perception, document vision, and aesthetic judgment to handle design and other visually dependent tasks without fabrication.
- Sectors: design, media, product marketing, UX research.
- Tools/products: GUI simulators; visual RL environments; datasets of human edits and aesthetic feedback.
- Assumptions/dependencies: Advances in multimodal modeling; reliable visual feedback loops; standardized GUIs.

These applications assume access to computer-use activity data (with consent and privacy safeguards), stable APIs/tools for agent action, reliable multimodal LMs for segmentation/annotation, and organizational readiness for process change. As agent capabilities evolve—especially in vision, UI interaction, and process adherence—the feasibility and impact of the long-term applications will increase.

View Paper Prompt View All Prompts

Glossary

Action-Goal Consistency: An evaluation criterion measuring whether the actions within a workflow step align with the step’s stated natural-language goal. Example: "Action-Goal Consistency"
action and state spaces: The set of allowable actions and observable states defining an interaction environment (often used in agent or RL settings). Example: "share the same action and state spaces"
Agent Frameworks: Software systems that orchestrate LLMs and tools to act as autonomous or semi-autonomous agents. Example: "Agent Frameworks"
Backbone (model backbone): The underlying pretrained model architecture powering an agent or system (e.g., GPT, Claude). Example: "LM backbones"
Cohen's Kappa: A statistic for measuring inter-rater agreement beyond chance. Example: "Cohen's Kappa"
compositionality: The property of building complex behaviors or representations by combining simpler parts. Example: "compositionality"
deterministic program execution: Running code that produces the same outputs given the same inputs without randomness, enabling reliable automation. Example: "deterministic program execution"
end-to-end (automation): Completing an entire task automatically from start to finish without human intervention. Example: "automate tasks end-to-end"
executable program evaluator: An automated checker that runs code or scripts to verify correctness of intermediate or final task outputs. Example: "an executable program evaluator"
hierarchical workflows: Multi-level workflow structures where high-level steps decompose into finer-grained sub-steps down to atomic actions. Example: "hierarchical, interpretable workflows"
human-agent teaming: Coordinated collaboration between humans and AI agents that assigns sub-steps to the party best suited for them. Example: "human-agent teaming"
long-horizon tasks: Tasks requiring extended sequences of steps and sustained planning over time. Example: "long-horizon tasks"
mean squared error (MSE): A pixel-level measure of visual difference used here to detect screen-change boundaries for segmentation. Example: "mean squared error (MSE)"
Modularity: The extent to which a workflow step is meaningfully distinct from adjacent steps (non-redundant and separable). Example: "Modularity"
multi-checkpoint evaluation protocol: An assessment method that checks progress at multiple intermediate stages, not just the final output. Example: "multi-checkpoint evaluation protocol"
multimodal LM: A LLM capable of processing and reasoning over multiple modalities (e.g., text and images). Example: "a multimodal LM"
O*NET database: A U.S. Department of Labor database cataloging occupations, tasks, and required skills. Example: "O*NET database"
open-source agent frameworks: Publicly available agent systems whose code and configurations can be inspected and modified. Example: "open-source agent frameworks"
order preservation (%): An alignment metric quantifying how often matched steps between two workflows appear in the same order. Example: "order preservation (%)"
program verifier checkpoints: Automated checks (often code-run) applied at various points to confirm correctness during task execution. Example: "program verifier checkpoints"
programmatic approach: Solving tasks primarily by writing and executing code rather than interacting through graphical interfaces. Example: "programmatic approach"
programmatic tools: Code-centric tools (e.g., Python, bash, HTML) agents use to carry out tasks instead of UI-based applications. Example: "programmatic tools"
ReAct: A prompting/agent technique that interleaves reasoning (“Thought”) and acting (“Action”) to guide tool use. Example: "ReAct program"
reinforcement learning: A learning paradigm where agents are trained via rewards for desired behaviors or outcomes. Example: "reinforcement learning with a final reward"
sandboxed environments: Isolated execution environments that host tools and websites while containing side effects for reproducibility and safety. Example: "sandboxed environments"
skill taxonomy: A structured categorization of skills used to analyze and cover work-related capabilities across occupations. Example: "Skill Taxonomy"
Step Delegation by Programmability: A strategy for assigning workflow steps to agents or humans based on how readily those steps can be programmed. Example: "Step Delegation by Programmability"
tool affordance: The design concept that the properties of a tool shape the ways users (or agents) can and tend to use it. Example: "tool affordance"
Trajectory Segmentation: The process of splitting raw activity sequences into meaningful sub-trajectories or steps. Example: "Trajectory Segmentation"
Workflow Alignment: The procedure and metrics for mapping and comparing steps between two workflows to assess similarity. Example: "Workflow Alignment"
Workflow Induction: Automatically transforming low-level actions and states into interpretable, hierarchical workflow steps with natural-language goals. Example: "Workflow Induction"

View Paper Prompt View All Prompts

Open Problems

Prioritize Programmatic Methods or UI Workflow Emulation for Half-Programmable Tasks

How Do AI Agents Do Human Work? Comparing AI and Human Workflows Across Diverse Occupations

Summary

Comparative Analysis of AI and Human Workflows Across Diverse Occupations

Methodology: Workflow Induction and Task Design

Key Findings: Alignment, Tool Usage, and Procedural Divergence

Workflow Alignment

Tool Affordance and Programmatic Bias

Impact of AI on Human Workflows

Quality and Efficiency: Fast but Flawed Agent Work

Human-Agent Teaming and Task Programmability

Implications for Agent Development and Future Directions

Conclusion

References to Figures

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

Overview

Key Objectives

Methods (How they studied it)

Main Findings

Implications (What this means)

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Practical Applications

Practical Applications Derived from the Paper

Immediate Applications

Long-Term Applications

Glossary

Open Problems

Continue Learning

Collections

Tweets

YouTube