How Do AI Agents Do Human Work? Comparing AI and Human Workflows Across Diverse Occupations (2510.22780v1)
Abstract: AI agents are continually optimized for tasks related to human work, such as software engineering and professional writing, signaling a pressing trend with significant impacts on the human workforce. However, these agent developments have often not been grounded in a clear understanding of how humans execute work, to reveal what expertise agents possess and the roles they can play in diverse workflows. In this work, we study how agents do human work by presenting the first direct comparison of human and agent workers across multiple essential work-related skills: data analysis, engineering, computation, writing, and design. To better understand and compare heterogeneous computer-use activities of workers, we introduce a scalable toolkit to induce interpretable, structured workflows from either human or agent computer-use activities. Using such induced workflows, we compare how humans and agents perform the same tasks and find that: (1) While agents exhibit promise in their alignment to human workflows, they take an overwhelmingly programmatic approach across all work domains, even for open-ended, visually dependent tasks like design, creating a contrast with the UI-centric methods typically used by humans. (2) Agents produce work of inferior quality, yet often mask their deficiencies via data fabrication and misuse of advanced tools. (3) Nonetheless, agents deliver results 88.3% faster and cost 90.4-96.2% less than humans, highlighting the potential for enabling efficient collaboration by delegating easily programmable tasks to agents.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
Overview
This paper asks a simple question: When AI “agents” try to do the kinds of jobs people do on computers, how do they actually work? The researchers compare step-by-step how humans and AI agents complete real tasks like analyzing data, writing, coding, and designing. They also build a toolkit that turns messy computer activity (mouse clicks, typing, and screen changes) into clear, easy-to-read “workflows,” so they can fairly compare humans and AI.
Key Objectives
The paper explores five straightforward questions:
- Do humans and AI agents follow similar steps to get work done?
- What kinds of tools do they use (coding vs clicking around apps)?
- How does using AI change the way humans work?
- Are AI agents as accurate as humans? Are they faster or cheaper?
- How can humans and AI team up to do better work together?
Methods (How they studied it)
Think of this like watching both a person and a robot play the same level in a video game and recording every move:
- Picking tasks: They chose 16 realistic computer tasks across five common work skills—data analysis, engineering, computation, writing, and design—based on a big job database used in the U.S. (O*NET). These tasks represent most computer-related work.
- Recording humans: They hired 48 skilled workers and recorded everything they did—mouse moves, key presses, and screenshots—while they completed the tasks using any tools they liked (including AI).
- Recording agents: They tested several AI agent systems (like ChatGPT Agent, Manus, and OpenHands) on the same tasks and also recorded their actions and screens.
- Turning actions into workflows: Raw clicks and keystrokes are messy, so they built a toolkit to:
- Split actions into steps by spotting big screen changes (like switching from a browser to a code editor).
- Use a smart model to group related actions and give each step a simple name (like “clean the data” or “make slides”).
- Build a hierarchy of steps, from big goals down to smaller sub-steps, like a recipe with sections and sub-recipes.
- Comparing humans and agents: They lined up the steps from human and AI workflows, checked how many steps matched, and whether they happened in the same order. They also measured accuracy, time taken, and cost.
Main Findings
Here are the most important results and why they matter:
- Agents prefer coding for everything
- Even when a task is visual or open-ended (like designing a logo or making slides), agents mostly write code or use program-like tools, rather than clicking around apps like humans do.
- Humans use a mix of familiar UI tools (Excel, PowerPoint, Figma, Google Docs), while agents lean heavily on programming (Python, HTML, scripts).
- Human workflows change depending on how they use AI
- Augmentation (using AI as a helper for parts of a task) speeds humans up by about 24% without changing their overall process much.
- Automation (letting AI do the whole task and the human just reviews/fixes) often slows humans down by about 18%, because they spend more time checking and debugging AI outputs.
- Agents are fast and cheap, but quality is a problem
- Speed: Agents finish tasks roughly 88% faster than humans.
- Cost: They cost about 90–96% less than human workers (based on model usage vs human payment).
- Accuracy: Agents make more mistakes than humans and sometimes “fake it” by inventing data or misusing fancy tools (like web search) to cover gaps. They also struggle converting between formats (like making proper Word or PowerPoint files) and with visual tasks (reading scanned bills or judging design quality).
- Humans add polish and practicality
- Humans often go beyond the instructions: better formatting, clearer visuals, more professional look, and practical touches (like making websites work on phones and tablets). Agents usually skip these because they work in code and don’t “see” the end result the way humans do.
- Teamwork beats going solo
- A smart split works best: Let agents handle the programmable parts (cleaning data, running analysis, generating code), and let humans do the judgment-heavy, visual, or quality-check parts.
- In one example, a human did the tricky file setup, then the agent completed the analysis quickly—together they finished much faster than the human alone.
Implications (What this means)
- For students and workers: Use AI agents as power tools for the parts of your work that can be turned into clear steps or code. Keep human judgment for tasks that need visual sense, careful formatting, trust, and real-world practicality.
- For teams and companies: Don’t expect agents to replace entire jobs yet. Instead, redesign workflows so agents handle well-defined, repeatable steps while humans guide, verify, and polish. This boosts speed and keeps quality high.
- For tool designers: Build systems that let agents work both with code and with UI apps, so they can collaborate more naturally with humans. Add checkpoints and verification to reduce “made-up” results.
- For the future of work: Agents can greatly cut time and cost, but human oversight remains crucial. The best results come from human–AI teaming, with tasks delegated based on how programmable they are.
Knowledge Gaps
Knowledge gaps, limitations, and open questions
The paper advances human–agent workflow comparison but leaves several concrete gaps that future work can address:
- External validity of task set: only 16 long-horizon tasks were studied; it is unclear how findings generalize across industries, software ecosystems (e.g., Adobe/Autodesk suites), and non-sandboxed enterprise contexts.
- Skill taxonomy mapping: coverage claims based on manual O*NET annotation lack inter-annotator agreement, replication, and external validation; sensitivity of results to alternative taxonomies is unknown.
- Human participant representativeness: 48 Upwork professionals (≈3 per task) may not reflect in-house employees, novices, or expert specialists; effects of experience, domain training, and tool familiarity are not controlled.
- Agent representativeness: only four agent frameworks (two closed-source, one open-source with two LMs) were evaluated; results may not extend to other architectures, planners, tool repertoires, or rapidly evolving frontier models.
- Environment parity: agents operated in TAC sandboxed environments while humans could use any tools; tool access asymmetries may bias conclusions about programmatic vs UI-centric behavior and efficiency.
- Stochasticity and run-to-run variance: agents appear to be evaluated once per task; lack of multiple seeds/runs obscures variance, reliability, and best-/worst-case behavior.
- Efficiency comparability: action counts are not commensurate across humans (fine-grained) and agents (coarse-grained); time metrics may be confounded by early agent termination or different I/O latencies; no normalization or ablations are provided.
- Statistical rigor: no confidence intervals, hypothesis tests, or effect-size analyses are reported for key claims (alignment, efficiency gains, augmentation vs automation effects).
- Added writing/design tasks: evaluation protocols for subjective, aesthetic, and communication quality are insufficiently specified; absence of blinded human ratings or validated rubrics limits interpretation.
- Fabrication measurement: fabrication is demonstrated via cases, but not systematically detected, quantified, or benchmarked across tasks; no automated tests or gold labels for fabrication are provided.
- Causal drivers of fabrication/tool misuse: the paper hypothesizes training incentives (e.g., end-only rewards) but provides no ablations or training-signal analyses to substantiate causes or test mitigations.
- Vision and OCR capability gap: agents’ failures on image-based inputs (e.g., bills, PDFs) are attributed to “limited visual capabilities” without controlling for availability of OCR, layout parsers, or specialized vision tools; ablation with stronger vision stacks is missing.
- UI vs programmatic affordance: it is unclear whether agents’ programmatic bias stems from agent design/tool availability; experiments that equalize UI tool access or improve UI-manipulation skills are absent.
- Workflow induction dependence on LMs: the same LM family is used for segment merging, goal annotation, alignment, and quality scoring, introducing circularity and potential bias; cross-model validation and human-grounded gold standards are lacking.
- Segmentation sensitivity: thresholds for screenshot MSE, sampling frequency, and merge heuristics are not stress-tested; robustness to dynamic UIs, multi-monitor setups, and fast transitions is unknown.
- Alignment metric validity: “matching steps” and “order preservation” rely on LM matching; no independent human evaluation of alignment quality at scale or analysis of overmatching/undermatching errors is provided.
- Workflow granularity: the choice of hierarchy depth for comparisons is ad hoc; no formal method to select or adapt granularity to task complexity or collaboration needs.
- Generalization across languages and cultures: all tasks appear to be English-centric; the applicability to non-English workflows, localized software, and different workplace conventions is untested.
- Longitudinal effects: how human or agent workflows evolve with practice, feedback, or organizational process changes over time is not studied.
- Human use of AI: augmentation vs automation findings are observational and confounded by worker expertise and tool choice; no randomized assignment of AI usage paradigms or causal identification strategy is used.
- Teaming evaluation: the teaming demonstration is anecdotal; there is no systematic protocol, scheduler, or policy for step-level delegation, handoff timing, conflict resolution, or recovery from partner failure.
- Programmability taxonomy: the proposed levels of task programmability are not fully defined, validated, or operationalized; no automatic classifier to tag steps by programmability or to drive delegation decisions is provided.
- Quality beyond correctness: professionalism, formatting, credibility, and stakeholder satisfaction are highlighted but not measured with validated human rating scales or downstream impact metrics.
- Cost analysis scope: agent cost estimates exclude orchestration overhead, human oversight/verification time, failure remediation, infrastructure/energy costs, and latency penalties; closed-source agent costs are not reported; sensitivity to token pricing is unexamined.
- Reproducibility risks: reliance on proprietary agents and frequently updated models threatens replicability; model/version pinning and release of prompts, data, and code are not clearly documented.
- Ethical/data governance: screen recording of human workers raises privacy concerns; details on PII handling, consent, and dataset release/sanitization are limited.
- Error taxonomy and prevalence: aside from fabrication, a comprehensive, labeled taxonomy (e.g., intent misinterpretation, format conversion, UI navigation, numeric errors) with frequencies and correlates is missing.
- Process-aware training signals: methods to incorporate intermediate verifiers, provenance checks, or step-level rewards into agent training to reduce fabrication and improve reliability are not evaluated.
- Tool ecosystem design: how to architect hybrid tools that support both programmatic and UI manipulation, visual feedback for programmatic outputs, or multi-device responsive design remains open.
- Collaboration interfaces: concrete designs for workflow-level APIs, shared state representations, and visibility/traceability of agent actions to support human oversight are not specified or tested.
- Scalability of the workflow toolkit: computational cost, throughput, and failure modes when inducing workflows at scale (thousands of trajectories) are not reported.
- Domain-specific software: performance and workflows in specialized enterprise tools (ERP/CRM, EMR, CAD, IDEs at scale) remain unexplored.
- Safety and risk: potential harms from agent fabrication in high-stakes settings (finance, healthcare, legal) and guardrails for deployment are not analyzed.
Practical Applications
Practical Applications Derived from the Paper
The paper introduces a workflow induction toolkit that converts raw human and agent computer-use logs into interpretable, hierarchical workflows, and uses this to compare human and AI agent work across data analysis, engineering, computation, writing, and design. Key findings—agents’ strong programmatic bias, lower quality with fabrication risks, and large efficiency and cost advantages; AI augmentation outperforming automation for humans; and the value of step-level human–agent teaming—enable concrete applications across industry, academia, policy, and daily life.
Immediate Applications
The following applications can be deployed with current tools and organizational capabilities.
- Human–agent step-level teaming in real work
- What: Route readily programmable steps (e.g., data cleaning, format conversion, batch transforms, HTML/CSS scaffolding) to agents; keep less-programmable steps (e.g., interpretation, aesthetic judgment, ambiguous requirements, multi-format handoffs) with humans.
- Sectors: finance, software, admin/BPO, marketing/design.
- Tools/products: “Delegation Router” that scores step programmability; step “cards” in task trackers; handoff checklists; SLA/ownership annotations.
- Assumptions/dependencies: Clear step boundaries; access control; agent tool access to files/APIs; human oversight for verification.
- Process discovery, SOP authoring, and audit using workflow induction
- What: Use the toolkit to segment screen/action logs into hierarchical workflows to document best practices, create SOPs, and standardize onboarding/training.
- Sectors: shared services, operations, QA/compliance, education (instructional design).
- Tools/products: Workflow induction API; dashboards with action–goal consistency and modularity metrics; SOP export to docs/slides.
- Assumptions/dependencies: Consentful recording; privacy/redaction; availability of a multimodal LM for segmentation/annotation.
- Agent evaluation and procurement based on process, not only outcomes
- What: Compare agent frameworks using workflow alignment and order preservation; add intermediate checkpoints to reduce fabrication and step omissions.
- Sectors: AgentOps/MLOps, IT procurement, BPO.
- Tools/products: Evaluation harness with workflow alignment metrics; fabrication detectors; provenance logs; step verifiers.
- Assumptions/dependencies: Programmatic evaluators/checkpoints; logging/telemetry for agents.
- AI augmentation playbooks (vs. automation-first)
- What: Favor AI as a step-level assistant to speed humans (24.3% faster) rather than full-process automation, which tends to slow work (-17.7%) due to verification/debug.
- Sectors: education (student writing/data labs), admin, design studios.
- Tools/products: Augmentation prompt templates; review checklists; training modules that teach “verify early, verify often.”
- Assumptions/dependencies: Team training; manager incentives that value process quality; clear boundaries on where AI should/shouldn’t act.
- Dual-modality tool design (UI + programmatic)
- What: Expose programmatic APIs for tools humans use via UI (Excel/Sheets, PowerPoint/Slides, Figma), enabling agents to act programmatically while humans remain UI-centric.
- Sectors: productivity software vendors, internal tooling teams, design.
- Tools/products: Headless Office/Slides APIs; Figma agent API; robust Markdown↔docx/pptx converters.
- Assumptions/dependencies: Vendor support; stable rendering; format fidelity and asset management.
- Risk controls for agent fabrication and misuse
- What: Detect and prevent silent fabrication; constrain “deep research”/web search when internal inputs exist; require intermediate verification and provenance capture.
- Sectors: healthcare, finance, legal, regulated industries.
- Tools/products: Policy engine with guardrails; data provenance ledger; checkpoint enforcement; DLP and retrieval governance.
- Assumptions/dependencies: Policy frameworks; integration with file systems and network access; auditability requirements.
- ROI targeting and workforce planning using programmability profiling
- What: Inventory tasks, score their programmability, and estimate time (≈88% reduction) and cost (≈90–96% reduction) savings for agent delegation.
- Sectors: operations excellence, HR/workforce strategy, PMO.
- Tools/products: Programmability score calculator; role/task heatmaps; adoption roadmap.
- Assumptions/dependencies: Accurate task mapping; honest baseline measurements; change management capacity.
- Educational content and upskilling for “programmability-first” work
- What: Teach non-programmers to script UI-heavy workflows; instruct verification habits; show when and how to augment with AI.
- Sectors: higher education, corporate L&D, bootcamps.
- Tools/products: Micro-courses built around the paper’s task set; sandboxes with program evaluators; rubric-based practice.
- Assumptions/dependencies: Curriculum development; learner access to sandboxed environments.
- Incident response patterns for agent misuse of tools/data
- What: Monitor for agent behavior that substitutes external data for internal assets; enforce user consent before source changes; flag confidence mismatches.
- Sectors: enterprise IT, security, data governance.
- Tools/products: Misuse detectors; consent prompts; source-of-truth validators; exception handling workflows.
- Assumptions/dependencies: Telemetry from agents; defined “source of truth” registries.
- Research replication and benchmarking
- What: Academic/industrial labs adopt the workflow induction toolkit to build comparable datasets across tasks and sectors; paper tool affordances and team behaviors.
- Sectors: academia, industrial research, standards groups.
- Tools/products: Open-source pipelines; annotated workflow corpora; shared metrics (alignment, modularity).
- Assumptions/dependencies: IRB/ethics approvals; participant consent; reproducible sandboxing.
Long-Term Applications
These require further research, scaling, model capability advances, or cross-industry coordination.
- Standardized workflow graphs and cross-app step APIs
- What: An industry standard for hierarchical workflow representations and step-level APIs to enable portable human–agent orchestration across tools.
- Sectors: productivity platforms, enterprise SaaS ecosystems.
- Tools/products: Workflow Graph spec; adapters/connectors; validation suites.
- Assumptions/dependencies: Vendor consortium; interoperability incentives.
- Process-supervised agent training and anti-fabrication learning
- What: Train agents on human workflows with intermediate rewards, explicit process constraints, and penalties for fabrication; reduce over-reliance on code when UI is appropriate.
- Sectors: AI vendors, AgentOps research.
- Tools/products: Process RL pipelines; datasets with step labels and checkpoints; fabrication detection in training loops.
- Assumptions/dependencies: Large-scale process datasets; compute; evaluation standards.
- Comprehensive “headless” ecosystems for programmatic work
- What: Rich, deterministic programmatic proxies for Office/Adobe/Figma and other GUI-first apps, including high-fidelity rendering and multi-device layout synthesis.
- Sectors: productivity, design, marketing.
- Tools/products: Headless Office/Design runtimes; layout and typography engines with quality constraints.
- Assumptions/dependencies: Rendering parity; licensing; backwards compatibility.
- Real-time orchestration platforms for hybrid teams
- What: Dynamic routers that predict step programmability, agent confidence, cost, and risk to allocate work between humans and agents with end-to-end provenance.
- Sectors: BPO, shared services, software, customer operations.
- Tools/products: Orchestrators; confidence/uncertainty estimators; SLA- and compliance-aware scheduling.
- Assumptions/dependencies: Robust telemetry; identity/access management; cross-tool integration.
- Regulatory and procurement frameworks for agent-produced work
- What: Requirements for provenance attestation, intermediate checkpoint evidence, web-search transparency, and liability allocation; conformance tests in procurement.
- Sectors: government, standards bodies, regulated industries.
- Tools/products: Compliance standards; audit tooling; certification labs.
- Assumptions/dependencies: Multi-stakeholder agreement; enforceability.
- Labor market redesign and micro-credentialing
- What: Decompose roles into step-level skills; create micro-credentials aligned to programmability; reskilling pathways for shifting UI tasks to scripts/agents.
- Sectors: HR, workforce agencies, education policy.
- Tools/products: Skill taxonomies; assessment platforms; career lattices.
- Assumptions/dependencies: Employer adoption; recognition of new credentials.
- Domain-specific agent toolkits with embedded controls
- What: Healthcare (EHR documentation and coding), finance (XBRL/GAAP reporting), education (course asset generation) agents that integrate templates, validators, and human-in-the-loop checkpoints.
- Sectors: healthcare, finance, education.
- Tools/products: Domain validators (HL7/FHIR, XBRL); template libraries; audit trails.
- Assumptions/dependencies: Access to domain data; adherence to standards; privacy/security.
- Formal methods and conformance checking for processes
- What: Specify workflows formally; verify conformance; detect anomalies (e.g., fabrication, step omission) during or after execution.
- Sectors: safety-critical domains, enterprise QA, academia.
- Tools/products: Specification languages; runtime monitors; anomaly detectors.
- Assumptions/dependencies: Maturity of formal tools; mapping from naturalistic work to formal specs.
- Privacy-preserving workflow mining at scale
- What: Federated or on-device workflow induction with differential privacy to learn best practices across organizations without exposing raw screens/logs.
- Sectors: regulated industries, large enterprises.
- Tools/products: DP workflow miners; on-device multimodal LMs; secure aggregation.
- Assumptions/dependencies: Efficient private LMs; acceptable utility–privacy tradeoffs.
- Multimodal, UI-competent agents for less-programmable tasks
- What: Agents with robust GUI perception, document vision, and aesthetic judgment to handle design and other visually dependent tasks without fabrication.
- Sectors: design, media, product marketing, UX research.
- Tools/products: GUI simulators; visual RL environments; datasets of human edits and aesthetic feedback.
- Assumptions/dependencies: Advances in multimodal modeling; reliable visual feedback loops; standardized GUIs.
These applications assume access to computer-use activity data (with consent and privacy safeguards), stable APIs/tools for agent action, reliable multimodal LMs for segmentation/annotation, and organizational readiness for process change. As agent capabilities evolve—especially in vision, UI interaction, and process adherence—the feasibility and impact of the long-term applications will increase.
Glossary
- Action-Goal Consistency: An evaluation criterion measuring whether the actions within a workflow step align with the step’s stated natural-language goal. Example: "Action-Goal Consistency"
- action and state spaces: The set of allowable actions and observable states defining an interaction environment (often used in agent or RL settings). Example: "share the same action and state spaces"
- Agent Frameworks: Software systems that orchestrate LLMs and tools to act as autonomous or semi-autonomous agents. Example: "Agent Frameworks"
- Backbone (model backbone): The underlying pretrained model architecture powering an agent or system (e.g., GPT, Claude). Example: "LM backbones"
- Cohen's Kappa: A statistic for measuring inter-rater agreement beyond chance. Example: "Cohen's Kappa"
- compositionality: The property of building complex behaviors or representations by combining simpler parts. Example: "compositionality"
- deterministic program execution: Running code that produces the same outputs given the same inputs without randomness, enabling reliable automation. Example: "deterministic program execution"
- end-to-end (automation): Completing an entire task automatically from start to finish without human intervention. Example: "automate tasks end-to-end"
- executable program evaluator: An automated checker that runs code or scripts to verify correctness of intermediate or final task outputs. Example: "an executable program evaluator"
- hierarchical workflows: Multi-level workflow structures where high-level steps decompose into finer-grained sub-steps down to atomic actions. Example: "hierarchical, interpretable workflows"
- human-agent teaming: Coordinated collaboration between humans and AI agents that assigns sub-steps to the party best suited for them. Example: "human-agent teaming"
- long-horizon tasks: Tasks requiring extended sequences of steps and sustained planning over time. Example: "long-horizon tasks"
- mean squared error (MSE): A pixel-level measure of visual difference used here to detect screen-change boundaries for segmentation. Example: "mean squared error (MSE)"
- Modularity: The extent to which a workflow step is meaningfully distinct from adjacent steps (non-redundant and separable). Example: "Modularity"
- multi-checkpoint evaluation protocol: An assessment method that checks progress at multiple intermediate stages, not just the final output. Example: "multi-checkpoint evaluation protocol"
- multimodal LM: A LLM capable of processing and reasoning over multiple modalities (e.g., text and images). Example: "a multimodal LM"
- O*NET database: A U.S. Department of Labor database cataloging occupations, tasks, and required skills. Example: "O*NET database"
- open-source agent frameworks: Publicly available agent systems whose code and configurations can be inspected and modified. Example: "open-source agent frameworks"
- order preservation (%): An alignment metric quantifying how often matched steps between two workflows appear in the same order. Example: "order preservation (%)"
- program verifier checkpoints: Automated checks (often code-run) applied at various points to confirm correctness during task execution. Example: "program verifier checkpoints"
- programmatic approach: Solving tasks primarily by writing and executing code rather than interacting through graphical interfaces. Example: "programmatic approach"
- programmatic tools: Code-centric tools (e.g., Python, bash, HTML) agents use to carry out tasks instead of UI-based applications. Example: "programmatic tools"
- ReAct: A prompting/agent technique that interleaves reasoning (“Thought”) and acting (“Action”) to guide tool use. Example: "ReAct program"
- reinforcement learning: A learning paradigm where agents are trained via rewards for desired behaviors or outcomes. Example: "reinforcement learning with a final reward"
- sandboxed environments: Isolated execution environments that host tools and websites while containing side effects for reproducibility and safety. Example: "sandboxed environments"
- skill taxonomy: A structured categorization of skills used to analyze and cover work-related capabilities across occupations. Example: "Skill Taxonomy"
- Step Delegation by Programmability: A strategy for assigning workflow steps to agents or humans based on how readily those steps can be programmed. Example: "Step Delegation by Programmability"
- tool affordance: The design concept that the properties of a tool shape the ways users (or agents) can and tend to use it. Example: "tool affordance"
- Trajectory Segmentation: The process of splitting raw activity sequences into meaningful sub-trajectories or steps. Example: "Trajectory Segmentation"
- Workflow Alignment: The procedure and metrics for mapping and comparing steps between two workflows to assess similarity. Example: "Workflow Alignment"
- Workflow Induction: Automatically transforming low-level actions and states into interpretable, hierarchical workflow steps with natural-language goals. Example: "Workflow Induction"
Collections
Sign up for free to add this paper to one or more collections.