Papers
Topics
Authors
Recent
Search
2000 character limit reached

How AI Agents Reshape Knowledge Work: Autonomy, Efficiency, and Scope

Published 5 Jun 2026 in cs.AI and econ.GN | (2606.07489v1)

Abstract: Frontier AI systems are bridging the gap between intelligence and utility by shifting from conversational assistants to autonomous agents that execute tasks end to end. Using production data from Perplexity's Search and Computer products, we study this transition by examining how AI agents accelerate and reshape knowledge work. Three key empirical findings emerge. First, using sessions with near-identical initial query pairs as natural experiments for the same underlying task attempted with both products, Computer performs 26 minutes of autonomous work per user session, versus 33 seconds for Search. Computer automates task decomposition and execution that Search users might otherwise manually orchestrate and implement. As a result, Computer shifts follow-up query distribution toward higher-order work such as verification and extension. Autonomy also increases execution quality, with per-query dissatisfaction rates 55% lower on Computer than on Search. Second, due to its autonomy advantage, Computer reduces completion time from 269 to 36 minutes on matched tasks, lowering estimated time and cost by 87% and 94%, respectively, compared to humans equipped with Search alone. Third, Computer changes the scope of work that users attempt: Computer queries more often cross occupational boundaries, require higher-order cognition, draw on broader expertise, take the form of composite tasks that bundle interdependent subtasks into a single query, and unlock work activities that are essentially absent from Search usage among the same users. Together, the evidence indicates that AI agents accelerate workflows, enhance output quality, reduce costs, and expand the breadth and depth of automated work.

Summary

  • The paper demonstrates that autonomous AI agents expand the set of feasible tasks, enabling higher-value and complex outputs.
  • It employs a knapsack optimization model and large-scale empirical data to show dramatic efficiency gains and cost reductions.
  • The study reveals that agent-driven autonomy reshapes task selection, promoting cross-occupational work and enhanced productivity.

How AI Agents Reshape Knowledge Work: Autonomy, Efficiency, and Scope

Introduction

The study "How AI Agents Reshape Knowledge Work: Autonomy, Efficiency, and Scope" (2606.07489) systematically investigates the transition from conversational AI assistants to fully autonomous AI agents, analyzing empirical behavior and economic implications using large-scale production data from Perplexity's Search (representing conversational assistants) and Computer (representing general-purpose agent orchestration). By examining adoption, autonomy, efficiency, and scope, the paper establishes a comprehensive empirical and theoretical framework for characterizing agent-driven transformation in knowledge work.

Conceptual Framework

The authors model task completion as a knapsack optimization under resource constraints, distinguishing between conversational assistants and agents via contrasting cost structures—higher fixed delegation cost but lower marginal execution cost for agents. This economic abstraction leads to several testable predictions: (1) agent access will expand the set of feasible (affordable) tasks to those with higher step counts and value, (2) realized value and surplus are weakly increased, and (3) resource-constrained users will reallocate time toward more complex, higher-value outputs. Importantly, a sorting threshold emerges where agents are preferred for long-horizon, compositional tasks, while conversational assistants retain an advantage for atomic or one-off queries.

The adoption of agents is thus predicted to (a) restructure task selection, (b) fundamentally alter cost/value tradeoffs in knowledge-intensive domains, and (c) enable novel recombinations of vertical (complexity) and horizontal (occupational) task dimensions.

Adoption Patterns and Use Cases

Empirical analysis reveals aggressive adoption rates for Perplexity Computer, with cumulative queries exhibiting 84x growth in three months post-launch, significantly outpacing Search among power users. Figure 1

Figure 1: Cumulative adoption growth, with Computer queries exhibiting 84x growth relative to their first week.

Task-level classification (n=100,000 queries) indicates dominance of complex or productive categories, with Research and Analysis accounting for 25.8% and Document/Asset Creation for 18.6%. Subject domains are broad, spanning software, finance, marketing, healthcare, legal, and education, with a sizable share targeting structured artifact creation (documents, codebases, dashboards). Figure 2

Figure 2: Use case distribution for Computer—complex, generative, and artifact-oriented workloads predominate.

Autonomy and Workflow Transformation

Matched user-session analysis of near-identical initial queries demonstrates a 48x increase in machine execution time per session for Computer (~26 minutes) versus Search (~33 seconds), with session distributions for machine time nearly disjoint. Figure 3

Figure 3: Computer session runtimes are right-skewed and exponentially higher than Search, reflecting deep autonomy.

Domain-level analysis confirms this effect across all fields, ranging from a 26x (science/education) to 75x (local, finance) increase in autonomy. Figure 4

Figure 4: Average per-session machine execution time is dramatically higher for Computer across all knowledge domains.

Analysis of follow-up interactions uncovers substantive shifts: Computer substitutes clarification and drill-down for extension and artifact iteration, reflecting its capacity to deliver near-complete outputs autonomously. Despite increased autonomy, observed dissatisfaction rates on next-turn follow-up are 55% lower for Computer (1.3% vs 2.9%), invalidating common hypotheses of quality/autonomy tradeoff.

Gains in Efficiency

By operationalizing "Search+Human" (manual orchestration) and "Computer+Human" (agent-executed, human oversight) regimes for matched tasks, the authors estimate dramatic time and cost reductions. Across 18 knowledge domains, Computer+Human yields an 87% reduction in completion time (269 → 36 minutes average) and a 94% reduction in cost, mostly attributable to humans' avoided execution labor. Figure 5

Figure 5: Despite higher model inference costs, Computer+Human's dramatic shift of labor from humans to machines yields 87–96% reduction in overall cost across domains.

Sensitivity analyses (stress-testing per-tool time and oversight estimates up to 26x) confirm that the efficiency gains are robust to substantial variation in human-time/cost assumptions. The savings persist even under deflationary estimates of counterfactual human labor, with Computer's cost advantage dissipating only under implausibly aggressive inputs.

Expansion in Scope: Horizontal and Vertical Dimensions

Horizontal expansion: Agent-driven workflows induce a significant increase in cross-occupation task attempts. Analysis of 8,000 dual-product users reveals that Computer queries are, on average, 9 percentage points more likely to address tasks outside the user's primary occupational cluster (59% vs 50%). This effect is most pronounced in management, technology, arts/design, and healthcare. Figure 6

Figure 6: Agents drive more frequent cross-occupation work, especially in high-leverage domains.

Further analysis shows that, unlike Search, which channels cross-domain traffic almost exclusively toward Digital Technology (lookup, coding, technical operations), Computer distributes cross-boundary queries across core professional domains, indicative of higher-level delegation and occupational recomposition.

Vertical expansion: Along cognitive/complexity axes, Computer queries concentrate at the "Create" level (Bloom's taxonomy: 50% vs 26%) and are nearly 3x as likely as Search to require expertise in ≥3 O*NET knowledge domains. Figure 7

Figure 7: Computer queries are disproportionately abstract, generative, and compose higher-order cognition relative to Search.

Figure 8

Figure 8: Computer queries require broader and deeper knowledge domain expertise, especially in creative/executional fields.

Computer also generates markedly more composite tasks as measured by engaged O*NET activity units (per-query: 32–60% broader at increasingly fine task granularity; e.g., 3.81 vs 2.38 distinct occupation-specific task statements engaged on average). Figure 9

Figure 9: Per-query breadth of engaged O*NET activities is substantially higher for Computer, and the gap widens with task specificity.

Distinctiveness is further underscored by "new task unlocked" analysis: 23% of Computer queries engage at least one fine-grained (task-statement) activity essentially absent from Search usage—even among the same users. Figure 10

Figure 10: A large proportion of Computer queries perform tasks never present in equivalent Search logs for dual-product users, concentrated on fine-grained executional actions.

Implications and Future Directions

The data establish that agent-enabled autonomy (1) accelerates execution and reduces costs, (2) improves user-perceived output quality, and (3) substantively expands both the vertical (complexity) and horizontal (occupational) scope of work attempted by knowledge workers. Machine labor substitutes for manual human execution, shifting user time toward supervisory, verification, and extension roles.

The findings indicate a reduction in coordination costs, recomposition of job bundles, and erosion of traditionally rigid occupational boundaries—trends expected to yield both organizational and labor-market restructuring with increased agent integration.

Theoretically, the results validate the predicted cost-structure-driven expansion of the feasible task/value frontier due to the lowered marginal costs of autonomous agents. Practically, the magnitude of observed time and cost savings has direct implications for enterprise productivity, workforce composition, and tool design.

Future research should extend analysis to longitudinal firm-level outcomes, team-level production, and labor market displacement/creation dynamics, as well as aggregate agent augmentation effects on institutional and economic structures. The possibility of widespread task recomposition and cross-functional role consolidation warrants particular attention.

Conclusion

This study provides clear evidence that deployment of autonomous AI agents in real-world knowledge work sharply increases autonomy, reduces human execution burden, and enables users to attempt both more sophisticated and more cross-disciplinary work. The scalability and generality of these efficiency and scope expansions suggest significant downstream ramifications for the organization of knowledge work and occupational structures as agentic automation matures.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

Explain it Like I'm 14

Overview: What this paper is about

This paper asks a big question: how do new AI “agents” change the way people do knowledge work (things like researching, writing, coding, analyzing, and planning)? The authors compare two kinds of AI tools:

  • A conversational assistant (Perplexity Search), which answers questions and helps step by step.
  • An autonomous agent (Perplexity Computer), which can plan and carry out many steps on its own to deliver finished work.

Using real-world usage data, they show that agents don’t just answer questions—they execute multi-step tasks, save a lot of time and money, and let people tackle bigger and more complex projects.

Key questions the paper tries to answer

The paper focuses on four simple questions:

  1. Autonomy: Do AI agents actually do more of the work by themselves?
  2. Efficiency: Do agents finish tasks faster and at lower cost than people using a regular AI assistant?
  3. Scope: Do agents change the kinds of tasks people even attempt—across different jobs and difficulty levels?
  4. Economics: When do agents make sense? How do they shift what’s “worth doing” with a given budget of time and money?

How the researchers studied it

The authors combined a simple economic model with real usage data from Perplexity’s products.

  • A simple model of task costs:
    • Think of a task as a number of steps (like a recipe with many actions).
    • Using an agent has a higher “setup” cost (you spend time giving clear instructions and later checking the result).
    • But agents have a lower “per-step” cost because they carry out many steps on their own.
    • Bottom line: for short tasks, a quick chat assistant can be enough; for longer, multi-step tasks, an agent is usually better.
  • Real-world comparisons:
    • They looked at millions of sessions from Feb–May 2026.
    • To make fair comparisons, they paired near-identical initial queries from the same user sent to both tools (“matched pairs”). This means the same person asked the same question in both products.
    • They required the agent sessions to actually “do work” (like browsing, coding, creating files, calling external services), not just chat.
    • They measured:
    • How much “machine work time” the AI did per session (agents vs assistants).
    • What users asked in follow-up messages (clarifying vs extending vs verifying).
    • How often users showed dissatisfaction.
    • How long a human would take to do the same task with only the assistant (time and cost estimates from tools, an independent AI judge, and user interviews).
    • Whether agent tasks crossed job boundaries (e.g., a marketer doing light coding) and required higher-order thinking or multiple skills.

Main findings and why they matter

Here are the main takeaways, written in everyday terms:

  • Agents actually “do the work,” not just talk about it.
    • In matched tasks, the agent (Computer) performed about 26 minutes of autonomous work per session, while the assistant (Search) did about 33 seconds—about 48 times more machine work.
    • Agents also connected to other apps and services more often, chaining many actions together that a human would otherwise do manually.
  • Follow-up messages shifted from “tell me more” to “build more.”
    • With agents, users spent relatively more time checking and extending the output (e.g., “now add a section,” “export this”), and relatively less on back-and-forth clarifications.
    • This means the agent delivers more complete drafts up front, and the human moves into “editor/manager” mode.
  • Quality improved, not just speed.
    • Dissatisfaction rates per query were 55% lower with the agent than with the assistant in matched cases.
  • Big time and cost savings.
    • For the same tasks, humans using only the assistant took about 269 minutes on average.
    • With the agent, the combined agent+human workflow took about 36 minutes—a reduction of ~87% in time and ~94% in cost.
    • To match the cost of the agent+human approach, a human using only the assistant would have to complete all the manual steps in under 20 minutes, which is rarely realistic for multi-step work.
  • Agents expand what people try to do.
    • Across 8 job areas (like technology, finance, healthcare, education), people using agents more often crossed into tasks outside their usual field.
    • Agent tasks required higher-order thinking (more “create” and “analyze,” less routine lookup), drew on more areas of knowledge at once, and bundled multiple subtasks into a single request.
    • About 23% of agent tasks included fine-grained activities that the same users never attempted with the assistant—suggesting agents unlock new kinds of work, especially at the execution level (like generating code, documents, spreadsheets, or websites).
  • Adoption grew quickly.
    • After launch, agent usage grew fast, and people who adopted agents also used the assistant more—suggesting the tools can complement each other instead of replacing one another.

Why this matters: These results show agents don’t just speed up existing steps; they re-shape workflows. People can act more like small teams: the agent produces, and the person reviews, guides, and extends.

What it means going forward

  • Bigger tasks become “worth it.”
    • The model and the data agree: agents have a higher setup cost but far lower per-step cost. That means tasks with many steps—once too slow or expensive—become doable within a normal time or budget.
    • As a result, the “frontier” of affordable work moves outward: more ambitious projects fit into the same day.
  • Skills and roles shift.
    • Humans spend more time specifying goals, checking results, and deciding what to do next—in other words, managing the agent and verifying quality—rather than doing every step by hand.
  • Less coordination across multiple people for certain tasks.
    • Because agents bundle and execute many steps, individuals can handle tasks that used to require handoffs across different roles, lowering coordination costs.
  • Broader impact on jobs and organizations.
    • If individuals can cross boundaries and complete more complex, multi-skill work, teams may reorganize around oversight, judgment, and integration, not just step-by-step production.
  • Important limits and care points.
    • Not every task fits the agent model—short, simple tasks are still faster with a quick assistant.
    • Verification remains essential: agents are powerful but still need human review.
    • The study is based on one product ecosystem and on observed logs; results may vary by task type, domain, or tool.

In short: AI agents don’t just give answers—they carry out work. They save lots of time and money, produce higher-quality outputs, and encourage people to attempt bigger, more complex projects. As this shift continues, people’s jobs may involve more directing and validating, and less manual execution.

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a concise list of what the paper leaves missing, uncertain, or unexplored, framed to guide concrete future research.

  • Causal identification beyond matched sessions: Despite near-identical initial queries, users self-select which product to use and when; unobserved factors (e.g., urgency, stakes, user confidence) may still confound comparisons. A randomized assignment of tasks to Search vs. Computer (or encouragement designs) is needed to establish causal effects.
  • Representativeness of dual-product users: Analyses focus on users who adopt both products, likely skewing toward tech-savvy early adopters. It remains unclear how results generalize to typical users, late adopters, or low-skill populations.
  • Short observation window: The three-month post-launch window may capture novelty and rapid iteration effects, not steady-state behavior. Longitudinal studies should track learning curves, sustained usage, changing task mix, and durability of gains over 6–18 months.
  • Framework quantities unmeasured: The theory’s fixed delegation cost, marginal step cost, and task step counts are not empirically identified. Concrete measurement/estimation of fAgentf_{\text{Agent}}, fConversationalf_{\text{Conversational}}, mAgentm_{\text{Agent}}, mConversationalm_{\text{Conversational}}, and step counts per task is needed to validate the model and its sorting threshold ss^*.
  • Welfare and value are assumed, not measured: The core predictions about realized value, surplus, and value-to-cost ratios remain untested because task value is unobserved. Future work should elicit deliverable value via blinded expert ratings, downstream business metrics, or user willingness-to-pay.
  • Autonomy metric conflates efficiency and latency: “Machine execution time” (wall-clock) may mix productive work with retries, waits, or inefficiencies and underestimates parallelized work. More granular telemetry (e.g., successful vs. failed tool calls, retry rates, idle/wait time) is required to isolate true autonomy.
  • Runtime capping and parallelism: Capping at 3 hours truncates the right tail, and parallel execution means wall-clock undercounts total machine work. Tail behavior and full machine-time accounting (aggregate CPU/GPU-seconds) remain unknown.
  • Task equivalence beyond the first turn: Near-identical initial queries can diverge due to agent planning or tool suggestions, so matched sessions may not reflect the same completed task. Auditing deliverables for functional equivalence is needed.
  • Quality measurement is narrow: The paper uses a “next-turn dissatisfaction” proxy (incomplete in the excerpt) and lacks ground-truth accuracy, expert grading, or post-hoc human evaluations of artifacts. Failure modes (e.g., subtle errors, hallucinations, non-compliance) and their rates are not quantified.
  • Success/failure rates and rework: The share of tasks successfully completed, the extent of rework required, and downstream defect rates are unreported. Benchmarking task success under explicit acceptance criteria is an open need.
  • Cost model omissions: Efficiency analyses do not incorporate platform subscription fees, compute surcharges for long agent runs, overhead from integrations/security reviews, or user training time. A full cost accounting (including total cost of ownership) is needed.
  • Human-time estimation uncertainty: Mapping “do” tool invocations to human-equivalent minutes rests on unvalidated assumptions; the LLM-based time estimate lacks ground truth; user interviews (n=25) are small and subject to self-report bias. Larger observational time-and-motion studies are needed.
  • External validity across domains and languages: The study does not report performance by language, locale, or accessibility needs. Whether autonomy and quality gains persist in non-English or specialized regulatory environments is unknown.
  • LLM-based labeling reliability: Task types, domains, Bloom levels, and O*NET mappings rely on single-model classifications without reported inter-rater reliability, calibration, or human validation. Future work should report agreement metrics and use multi-annotator adjudication.
  • O*NET coverage limitations: Modern digital tasks may not map cleanly to O*NET hierarchies; “new tasks unlocked” may partly reflect taxonomy gaps rather than genuine capability expansion. Alternative and updated taxonomies should be tested.
  • Heterogeneity of effects: The paper does not quantify how gains vary by user skill, occupation seniority, task complexity, or project size. Stratified analyses (e.g., novices vs. experts, small vs. large tasks) are necessary.
  • Boundaries and failure cases: Conditions under which the agent underperforms (e.g., tasks beyond capability frontier, ambiguous objectives, high-stakes compliance) are not characterized. A “jagged frontier” analysis for agents is missing.
  • Risk, safety, and compliance: The incidence and impact of risky behaviors (e.g., unsafe actions via connectors, privacy breaches, biased outputs, insecure code) are not measured. Audits and red-teaming focused on agentic actions and connector usage are needed.
  • Connector ecosystems and reliability: Autonomy is partly driven by connector calls, but success rates, latency, brittleness, vendor differences, and error handling across connectors are not analyzed. Robustness across tool ecosystems is an open question.
  • Organizational-level impacts: The paper infers reduced coordination costs but provides no team- or firm-level outcomes (throughput, cycle time, quality, release cadence). Field experiments at team and project levels are needed to track how gains propagate.
  • Complementarity with existing tools: The DiD result (appendix) suggests complementarity with Search, but mechanisms (learning vs. task spillovers) are unclear and causal identification may rely on strong assumptions. Deeper causal analysis of complementarities and substitutions across tools is warranted.
  • Human oversight burden and cognitive load: While delegation reduces manual steps, the cognitive demands of scoping, monitoring, and verifying agent work are not quantified. Measuring oversight time, error detection costs, and reviewer fatigue is necessary.
  • Pricing and unit economics: The paper does not analyze how agent usage scales with pricing, compute costs, and billing models (e.g., per-minute vs. per-output). Demand elasticity and optimal pricing for long-horizon autonomy remain open.
  • Environmental footprint: Increased machine work per session likely raises compute and energy usage; environmental costs per task are not reported. Carbon/energy intensity comparisons between workflows are needed.
  • Security and data governance: Working across user environments raises security and privacy questions (e.g., data exfiltration, permission scopes). The incidence and mitigation of such risks are not quantified.
  • Generalization to other agent systems: Findings are from Perplexity’s stack during early post-launch; portability to other agent frameworks, models, and enterprise settings is untested.
  • Measurement under privacy constraints: The “no raw queries to analysts” policy safeguards privacy but limits deep content validation and error analysis. Methods for privacy-preserving ground-truth evaluation (e.g., secure enclaves, differentially private auditing) are needed.
  • Dynamic product improvements as confounders: Rapid model and product updates during the study window may confound estimates. Version-controlled analyses that lock model/tool versions are needed for clean before-after comparisons.
  • Downstream business outcomes: The paper measures task-level time reductions but not whether faster tasks translate into better project outcomes, revenue, customer satisfaction, or time-to-market in non-coding domains. Linking task gains to business KPIs remains open.

Practical Applications

Immediate Applications

The paper’s evidence that autonomous AI agents perform 26 minutes of machine work per session (vs. 33 seconds for assistants), cut task time by 87% and cost by 94%, reduce dissatisfaction by 55%, and expand cross-occupational scope yields the following deployable use cases.

  • Application: Agent-led research briefs and literature reviews (Sectors: academia, healthcare, legal, finance, media)
    • Value: Faster, higher-quality syntheses with citations; shifts human effort to verification and extension.
    • Workflow/tools: Perplexity Computer orchestrates search, browsing, source cross-checks, and file creation; pause-for-approval prompts; connectors to academic indexes and document stores.
    • Dependencies/assumptions: Access to licensed content; domain-expert verification; provenance-capture and audit logs for compliance.
  • Application: Document and asset creation pipelines (Sectors: marketing, sales, product, HR)
    • Value: Structured artifacts (briefs, policies, slide decks, emails) produced end-to-end; aligns with observed high share of document/asset creation.
    • Workflow/tools: Agent generates drafts in Docs/Notion/CMS via connectors; approval gates; iterative extensions.
    • Dependencies/assumptions: Brand/style guardrails; content review workflows; data-loss prevention controls.
  • Application: Market/technical due diligence and competitive landscaping (Sectors: finance, BD, legal, strategy)
    • Value: Comprehensive profiles and market maps with source triangulation; composite tasks bundled in one run.
    • Workflow/tools: Browsing + API calls, spreadsheet generation, memo drafting; verification turns to validate key claims.
    • Dependencies/assumptions: Data-license compliance; configurable source whitelists; risk disclosures.
  • Application: FP&A and operational reporting automation (Sectors: finance, operations)
    • Value: Month-end reports and commentary assembled autonomously; significant time savings on multi-step workflows.
    • Workflow/tools: Connectors to BI/warehouses (Snowflake/BigQuery), Excel/Sheets; templated narratives; pause for manager approval.
    • Dependencies/assumptions: Read-only, least-privilege access; reconciliations; segregation of duties.
  • Application: Engineering scaffolding and test generation (Sectors: software)
    • Value: Agents plan, scaffold features, generate tests, and draft PRs; humans supervise and merge (consistent with agentic coding gains in prior work).
    • Workflow/tools: Repo connectors (GitHub/GitLab), code execution sandboxes, CI hooks; approval checkpoints.
    • Dependencies/assumptions: Branch protection; secure token management; codebase size and context limits.
  • Application: Policy/SOP drafting and regulatory mapping (Sectors: compliance, healthcare admin, manufacturing)
    • Value: Drafts mapped to cited regulations; verification-focused human review.
    • Workflow/tools: Knowledge retrieval + drafting; cross-references; redlining in DMS.
    • Dependencies/assumptions: Attorney/compliance officer sign-off; up-to-date regulatory sources.
  • Application: Curriculum, lesson planning, and assessment authoring (Sectors: education)
    • Value: Create lesson plans, rubrics, quizzes; supports higher-order task creation observed in data.
    • Workflow/tools: LMS connectors; banks of aligned objectives; iterative improvements based on outcomes.
    • Dependencies/assumptions: Academic integrity policies; teacher oversight.
  • Application: Customer support knowledge base refresh and reply drafting (Sectors: customer service, SaaS)
    • Value: Faster article updates and suggested responses; verification/extension turns for correctness.
    • Workflow/tools: CRM/helpdesk connectors; log parsing; doc publishing via CMS.
    • Dependencies/assumptions: Access to tickets/logs; escalation rules; hallucination mitigation.
  • Application: Cross-functional project bundling by single operator (Sectors: SMBs, startups, freelancers)
    • Value: One person produces multi-asset packages (copy, landing page, email flows, analytics setup), reducing coordination costs.
    • Workflow/tools: Web/CMS builders, email platforms, analytics connectors; agent orchestrates; user verifies.
    • Dependencies/assumptions: API keys and account permissions; template libraries; clear acceptance criteria.
  • Application: Verification-first QA workflows for AI outputs (Sectors: all)
    • Value: Institutionalize approve/hold prompts and checklists, mirroring observed shift to verification and extension.
    • Workflow/tools: Pause-for-user states, automated checklists/tests, diff viewers, sign-off trails.
    • Dependencies/assumptions: Defined QA criteria; auditable logs; role-based access.
  • Application: Task router for “assistant vs. agent” selection (Sectors: software/IT, enterprise IT)
    • Value: Route short tasks to conversational assistants and long, multi-step tasks to agents, leveraging the step-count threshold insight.
    • Workflow/tools: Step-count or complexity estimators; policy engine; monitoring.
    • Dependencies/assumptions: Accurate task-complexity prediction; fallback paths; usage analytics.
  • Application: Cost/time accounting dashboards for AI runs (Sectors: operations, ITFM)
    • Value: Real-time ROI using per-session runtime and human/model costs; breakeven alerts (e.g., <20 min human threshold from paper).
    • Workflow/tools: Telemetry on agent runtime, tool calls, human approvals; cost attribution; reporting.
    • Dependencies/assumptions: Consistent logging; cost catalogs for human and model rates; tagging standards.
  • Application: Talent and HR workflows (Sectors: HR, recruiting)
    • Value: Draft job descriptions, screen resumes to criteria, generate interview guides; supervisors verify.
    • Workflow/tools: ATS connectors; rubric-driven scoring; structured outputs.
    • Dependencies/assumptions: Bias auditing; EEOC compliance; human decision-making retained.
  • Application: Public-sector grant and policy memo drafting (Sectors: government, nonprofits)
    • Value: Faster briefs and grant narratives; citations and data tables included; human verification loop.
    • Workflow/tools: Data source connectors; document generation; approval workflows.
    • Dependencies/assumptions: Privacy and data residency rules; procurement approvals; records retention.
  • Application: Personal productivity automations (Sectors: daily life, sole proprietors)
    • Value: Trip planning, vendor comparisons, resume tailoring, website setup with verification-first review.
    • Workflow/tools: Calendar/email/CMS connectors; checklists; budget/time calculators.
    • Dependencies/assumptions: Caution on high-stakes (legal/medical); explicit user permissions; privacy hygiene.

Long-Term Applications

These applications will benefit from further advances in reliability, integration, governance, and scaling, reflecting the paper’s long-horizon autonomy trajectory and broader scope shifts.

  • Application: End-to-end autonomous back office (Sectors: finance, HR, procurement)
    • Value: From data pulls to reconciliations and narrative commentary with minimal human touch.
    • Dependencies/assumptions: Robust auditability, segregation of duties, regulatory clearance, near-zero hallucination rates.
  • Application: Organizational redesign around “agent supervisors” and cross-occupational roles (Sectors: all)
    • Value: Individuals own composite workflows spanning multiple occupations (consistent with observed 9 pp cross-occupation and higher cognitive complexity).
    • Dependencies/assumptions: Change management, new performance metrics, labor agreements, upskilling programs.
  • Application: Marketplaces and routers for specialized agents with cost-aware bidding (Sectors: software platforms)
    • Value: Route tasks to the most cost-effective, capable agent; enforce SLAs.
    • Dependencies/assumptions: Interoperability standards (e.g., MCP), benchmarking, billing transparency.
  • Application: Regulated-domain certified agents (Sectors: healthcare, legal, finance)
    • Value: Agents that meet domain-specific safety/quality thresholds (e.g., clinical documentation + order sets; first-draft legal filings).
    • Dependencies/assumptions: Liability frameworks, certification regimes, human-in-the-loop mandates.
  • Application: Persistent, long-horizon autonomous projects (Sectors: R&D, software migrations, policy analysis)
    • Value: Agents run for days/weeks across workstreams, checkpointing progress and adapting plans.
    • Dependencies/assumptions: Reliable memory/state management, robust failure recovery, monitoring and pause/override controls.
  • Application: Standardized agent metrics and audits for procurement and compliance (Sectors: enterprise, public sector)
    • Value: Adopt “time horizon,” autonomy minutes, and dissatisfaction rates as purchase and oversight metrics.
    • Dependencies/assumptions: Industry consensus, third-party auditors, shared benchmarks.
  • Application: Education transformation with agent-tutor orchestration (Sectors: education)
    • Value: Mastery learning loops, individualized curricula, auto-grading with human spot checks.
    • Dependencies/assumptions: Pedagogical validation, equity considerations, academic integrity safeguards.
  • Application: Large-scale public service automation (Sectors: government)
    • Value: Benefits processing, FOIA responses, case triage with transparency and fairness constraints.
    • Dependencies/assumptions: Legacy system integration, privacy laws, bias mitigation and oversight.
  • Application: Real-world action via RPA/IoT/edge devices (Sectors: facilities, labs, logistics)
    • Value: Agents that act beyond software—instrument control, facility scheduling, routine operations.
    • Dependencies/assumptions: Secure physical interfaces, safety interlocks, incident response protocols.
  • Application: Updated risk, insurance, and compliance models for agent-run operations (Sectors: finance, cyber)
    • Value: New underwriting for agent-related operational/cyber risks; continuous control monitoring.
    • Dependencies/assumptions: Incident datasets, reporting standards, regulatory guidance.
  • Application: Labor-market policy responses (Sectors: policy)
    • Value: Targeted reskilling, wage insurance, portable benefits aligned with task recomposition dynamics.
    • Dependencies/assumptions: Funding and political will; longitudinal evidence on displacement vs. reinstatement.
  • Application: Enterprise data clean rooms and fine-grained permissions for agents (Sectors: enterprise IT)
    • Value: Secure, auditable agent access to sensitive data with context integration at scale.
    • Dependencies/assumptions: Robust identity/permissions, data residency controls, privacy-by-design.

Notes on Feasibility Assumptions

  • Task economics: Gains are largest on high-step tasks that amortize delegation overhead; task routers should estimate step/complexity to maximize ROI.
  • Reliability and safety: Verification-first workflows, audit trails, and human approvals are critical, especially in regulated or high-stakes contexts.
  • Integration: Benefits depend on connectors (APIs, MCP) to apps, data warehouses, repos, and content systems; permissions and security are prerequisites.
  • Measurement: Telemetry on autonomy time, connector calls, dissatisfaction signals, and costs is necessary for governance and continuous improvement.
  • Change management: Cross-occupational scope expansion implies training, new roles (agent supervisors), and updated performance and accountability frameworks.

Glossary

  • 0-1 knapsack problem: A combinatorial optimization problem where items with values and costs are selected to maximize value under a budget constraint, each chosen at most once. "The user therefore solves a standard 0-1 knapsack problem"
  • Affordable task frontier: The boundary of tasks that can be completed within a given budget or resource constraint. "agent access expands the affordable task frontier toward weakly higher-value tasks"
  • Agent orchestration: Coordinating multiple tools, steps, and sub-agents to autonomously plan and execute tasks. "the shift from conversational assistants to agent orchestration across a wide spectrum of knowledge work."
  • Agent orchestration system: A platform that manages planning and multi-tool execution across environments to fulfill delegated outcomes. "Computer is a general-purpose agent orchestration system that performs work across increasingly broad environments and long horizons."
  • Agent orchestrator: An agent that coordinates other tools or agents to carry out complex workflows. "Computer combines long-horizon asynchronous execution with even deeper and broader context integration as an agent orchestrator."
  • Agentic browser: A web browser that embeds autonomous agent capabilities to act within web contexts. "Perplexity's Comet agentic browser"
  • Answer engine (product category): A system that returns synthesized, cited answers to user queries rather than just links. "Perplexity Search introduced the answer engine product category: it allows a user to ask a question and receive a cited, synthesized answer from a knowledge base comprising billions of documents."
  • Application Programming Interface (API): A standardized interface for software to communicate with external services or applications. "via Model Context Protocol (MCP) or an Application Programming Interface (API) endpoint"
  • Asynchronous delegation: Assigning tasks to an agent to execute independently over time without continuous user interaction. "Our setting differs because Computer replaces the interactive loop with asynchronous delegation."
  • Autonomy premium: The cost advantage per step gained by autonomous execution compared to human-in-the-loop execution. "their gap is the autonomy premium."
  • Bloom's Revised Taxonomy: A framework classifying cognitive complexity levels from remembering to creating. "cognitive complexity (Bloom's Revised Taxonomy and routine versus abstract task types)"
  • Breakeven analysis: An evaluation identifying the point at which two options have equal cost or performance. "A breakeven analysis shows that a Search-aided human professional would need to complete all manual steps in under 20~minutes to match the cost of Computer~+~Human."
  • Capability frontier: The boundary delineating tasks that a model or system can perform successfully. "tasks within the model's capability frontier"
  • Composite tasks: Single requests that bundle multiple interdependent subtasks into one larger task. "take the form of composite tasks that bundle interdependent subtasks into a single query"
  • Connector call: An invocation from an agent to an external application or service via a connector. "7.9\% of Computer sessions invoke at least one connector call versus 1.8\% of Search sessions"
  • Cosine similarity: A metric for measuring similarity between two vector embeddings based on the cosine of the angle between them. "(cosine similarity >0.99> 0.99)"
  • Difference-in-differences design: A causal inference method comparing changes over time between treated and control groups. "we isolate the causal effect with a matched difference-in-differences design"
  • Dynamic programming: An algorithmic technique solving problems by breaking them into overlapping subproblems and reusing solutions. "We describe a standard dynamic programming solution in Appendix~\ref{app:dp}."
  • End-to-end work execution engine: A system that plans and completes tasks from start to finish with minimal human intervention. "The shift is from AI as a conversational assistant to AI as an end-to-end work execution engine, characterized by greater autonomy and deeper integration into the user's entire digital environment."
  • Fixed per-task cost: A one-time cost incurred for delegating or initiating a task, independent of the number of steps. "Each mode has a fixed per-task cost:"
  • Frontier AI systems: The most advanced AI models and products at the cutting edge of capabilities. "Frontier AI systems are bridging the gap between intelligence and utility by shifting from conversational assistants to autonomous agents that execute tasks end to end."
  • Generalized Work Activities: High-level categories of work behaviors from the O*NET taxonomy used to describe job tasks. "Computer queries engage 2.95 of O*NET's Generalized Work Activities on average"
  • Human bottlenecks: Constraints in workflows where human involvement limits throughput or progress. "consistent with human bottlenecks limiting how much reaches shipped software."
  • Human-equivalent minutes: An estimate translating automated tool execution into the minutes a human would spend to perform the same steps. "The tool-based estimate sums human-equivalent minutes for each ``do'' tool invocation observed in the Computer thread"
  • Human-in-the-loop: A system design where humans participate in or supervise the execution of task steps. "the human-in-the-loop cost per step"
  • Indivisible (tasks): Tasks that must be completed in full to realize value; partial completion yields no value. "Equivalently, task opportunities are indivisible: a user attempts task jj in full or not at all."
  • Jagged frontier: The uneven boundary where AI excels at some tasks but fails at others of similar apparent difficulty. "a ``jagged frontier'' that underscores the importance of task--tool fit."
  • Long-horizon: Involving extended durations or many steps to complete tasks. "long-horizon asynchronous execution"
  • Model Context Protocol (MCP): A protocol for connecting models to external tools and data sources. "via Model Context Protocol (MCP) or an Application Programming Interface (API) endpoint"
  • Natural experiments: Observational settings where external conditions approximate random assignment, allowing causal inference. "using sessions with near-identical initial query pairs as natural experiments for the same underlying task"
  • O*NET Knowledge domains: Subject areas from O*NET describing the bodies of knowledge required for jobs. "each Computer query requires substantive expertise in an average of 2.40 distinct O*NET Knowledge domains"
  • O-ring model: An economic model of multi-step production emphasizing that failures in any step can reduce overall output. "extend the O-ring model of multi-step production to show that the return to automation is limited by the human bottlenecks in the process."
  • Occupation-specific Task Statements: Fine-grained task descriptors tied to specific occupations in O*NET. "and 60\% more occupation-specific Task Statements (3.81 vs.\ 2.38) engaged per query."
  • Task recomposition: The restructuring of job tasks into new bundles due to technology or process changes. "autonomous agent capabilities, and task recomposition."
  • Task-based framework: An analytical approach modeling work as collections of discrete steps or tasks with associated costs and values. "We adopt an individual-level task-based framework"
  • Time horizon metric: A measure of the duration at which agents achieve a target success rate on tasks. "\citet{kwa2025measuring} introduce the ``time horizon'' metric (the task duration at which agents achieve 50\% success rate)"
  • Usage-based measure: An empirical metric constructed from real usage data to assess exposure or impact. "\citet{massenkoff2026labor} introduce a usage-based measure of AI exposure to document a capability-deployment gap across occupations."
  • Value-to-cost ratio: A performance metric comparing the value generated to the cost incurred. "when the pre-agent budget binds, surplus and the value-to-cost ratio also weakly increase."
  • Wall-clock time: Actual elapsed time from start to finish, as experienced by users, regardless of parallelization. "For Computer, we compute per-turn wall-clock time"

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 4 tweets with 475 likes about this paper.