The Shift to Agentic AI: Evidence from Codex

Published 25 Jun 2026 in econ.GN | (2606.26959v1)

Abstract: We analyze usage data from OpenAI's Codex tool to present large-scale evidence of how agentic AI technology, which can take actions on a user's behalf, changes how people work. We use an automated, privacy-protecting pipeline to contrast usage across three populations: external personal-account users, external organizational-account users, and workers within OpenAI. We find that agentic AI usage is growing rapidly: the number of active users has grown more than fivefold in the first half of 2026, with the most rapid increase occurring outside the initial audience of software developers. Uptake is uneven: within OpenAI, Codex usage is nearly universal and has largely replaced business usage of ChatGPT. We document a similar shift to agentic tooling outside OpenAI, particularly within organizations, although external adoption remains lower and more uneven. In addition to headline usage figures, we observe measures of sophistication, and find that a growing number of users have used Codex to change their workflows substantially. More than 10% of users manage three or more concurrent Codex agents at some point each week and that 26.6% use skills, which allow users to share instructions for complex workflows. Alongside these changes in usage practices, request complexity has increased: since the start of the year, the share of individual Codex users who submit at least one request for a task estimated to require more than eight hours for an experienced human to complete has increased nearly tenfold. Concurrently, output has grown rapidly -- in June 2026, the median OpenAI employee in a legal role generated 13 times more monthly output tokens across Codex and ChatGPT than they did in November 2025, while the median researcher generated more than 50 times as many. We conclude by discussing the implications of these patterns for productivity, job reorganization, and workforce restructuring.

Abstract PDF Upgrade to Chat

Authors (6)

Summary

The paper’s main finding is that agentic AI, as exemplified by Codex, shifts work delegation from routine tasks to complex, multi-step workflows.
The study employs automated classifiers on anonymized usage logs to segment user roles and quantify task complexity using token ratios and parallel task analysis.
Results indicate that organizational complements and systematized workflows drive exponential productivity gains in both technical and non-technical domains.

The Shift to Agentic AI: A Comprehensive Analysis of Codex Usage and Implications

Introduction

"The Shift to Agentic AI: Evidence from Codex" (2606.26959) presents a rigorous empirical investigation into how the introduction and adoption of agentic AI technology, operationalized through OpenAI's Codex, is restructuring the landscape of knowledge work, software development, and workforce organization. The study leverages a large-scale, privacy-preserving analysis spanning individual, organizational, and OpenAI internal users, providing a granular and comparative perspective on usage dynamics as agentic tools become increasingly integrated into professional workflows.

Methodological Framework

The core methodology utilizes automated classifiers on anonymized usage logs to infer user roles, task types, seniority, and complexity of assigned tasks. By segmenting users into three populations—individual accounts, business/enterprise organizational users, and OpenAI employees—the analysis controls for context-specific adoption frictions, knowledge-sharing practices, and proximity to the AI development frontier. The paper's indicators include user counts, output token ratios, task-by-task breakdowns, concurrent agent management, workflow systematization via skills/plugins, and validation of complexity assessment pipelines.

Patterns of Adoption and Output

Adoption trajectories for Codex underscore both a rapid increase in active user counts and heterogeneity of integration depth across contexts. Codex’s penetration is most rapid and universal within OpenAI, where organizational alignment, lowered marginal cost, and extensive internal expertise yield nearly complete replacement of conversational ChatGPT for business usage. In this environment, Codex accounts for over 99% of output tokens, serving as the primary interface for substantive work.

Figure 1: Codex usage relative to ChatGPT, by user population.

Among organizational accounts, adoption is broader than among individual users and reflects an ongoing transition, with Codex processing the majority of output tokens, especially among technical job families. In contrast, while only a small fraction of individual users engage Codex, those who do are markedly more intensive users, leading to a disproportionate share of work—measured by output tokens—handled through the agentic interface.

Figure 2: Codex share of output tokens for the average user, split by worker type.

Growth is not confined to developers; automation is spreading among non-technical and managerial roles, especially where workflow systematization can amplify impact, such as in legal, HR, and business operations.

Evolution of Delegated Work and Task Complexity

The architecture of agentic AI enables users to delegate not just isolated queries but multistep, tool-using, result-modifying workflows. The study’s automated taxonomy reveals that although the leading edge of Codex usage remains in core software development (engineering ops, implementation, validation, and application management), the functionalities utilized extend markedly into non-coding knowledge work, especially among users with deeper adoption.

Figure 3 provides strong evidence of a regime shift in unit task complexity among individual users. There is a tenfold increase since late 2025 in the proportion of users delegating tasks estimated to require more than eight hours of expert human effort—a finding which underscores the tool’s transition from an assistant for incremental improvements towards a platform capable of handling genuinely substantive projects.

Figure 3: Model-estimated complexity of Codex turns among Individual account users.

This trend towards higher complexity is most noticeable at the initiation of new agent interactions, suggesting that users start with broader or ill-structured problems, subsequently narrowing focus through iterative refinement.

Workflow Transformation: Concurrency, Runtime, and Systematization

An agentic architecture facilitates concurrency and persistent delegation, shifting the human role towards high-level supervision, integration, and quality assurance. While most external users exhibit minimal parallel agent use, within OpenAI, the norm for intensive users is to manage multiple long-running agents executing in parallel—effectively operating as managers of lightweight, specialized agentic teams.

Figure 4: Codex turn concurrency among user groups, highlighting the prevalence of parallel workflows among OpenAI workers.

Additionally, cumulative daily runtime for Codex agents is several orders of magnitude greater at the extreme tail within OpenAI compared to external populations, indicative of workflows where technical and non-technical workers run numerous, possibly semi-autonomous, persistent agents.

Figure 5: Cumulative Codex turn duration per day, contrasting median and upper-tail users across account types.

Systematized workflows are increasingly codified as reusable skills and plugins, reflecting a transition from ad hoc assistance to persistent workflow engineering. Skills not only leverage core Codex capabilities but also encode domain-specific procedural knowledge, tool integrations, and organizational norms, diffusing rapidly and demonstrating sharp differences in adoption rates between OpenAI and other populations.

Figure 6: The growth and distribution of skill use (systematized workflows) across time and user categories.

Task Domain and Persona Analysis

The persona classifier isolates developers, general knowledge workers, and personal users, finding accelerating adoption among the former two categories and across both technical and non-technical verticals. Task decomposition by role and seniority further demonstrates that, while the majority of Codex output in organizations is in complex engineering operations, the relative share for documentation, planning, research, and business workflows rises significantly in higher seniority and managerial positions, as well as in functions beyond engineering.

Figure 7: Task portfolio breakdown by persona for individual accounts, with non-developer growth observable.

Figure 8: Codex persona mix by account category, showing broadening beyond initial developer-centric adoption.

Evaluation and Implications

The paper presents robust evidence that agentic AI’s value proposition is fundamentally distinct from prior conversational paradigms: the critical margin is not merely “usage” but “work delegated”, measured in terms of complexity, duration, parallelism, and standardization. Especially within organizations that have invested in complementary assets (procedural knowledge, training, management buy-in), adoption leads to rapid growth in AI-mediated output—median monthly Codex/ChatGPT token volume increased 13-fold for legal job roles and over 50-fold for researchers within OpenAI over a seven-month period.

This division highlights a non-trivial dependency on organizational complements for realizing productivity gains: agentic systems yield transformative leverage only when supported by redesigned workflows, enabling repeatable, parallelized, and verifiable delegation. As adoption deepens in technical domains, the boundary of “agentic” work expands, with growing evidence of functional diffusion into planning, communication, analysis, and even organizational knowledge management.

Theoretical and Practical Consequences

The acceleration in task complexity and workflow systematization portends substantial impending change in workforce organization. Roles are observed to shift from individual execution toward overseeing distributed portfolios of agentic activity. Supervision, verification, and coordination become central, increasing the value of domain expertise, oversight, and meta-level management skills relative to routine subtask execution. Job ladders, hiring criteria, team composition, and management structures will likely need to adjust rapidly as this shift matures.

The observation that productivity gains are highly concentrated among those who adapt and re-architect workflows—rather than merely substituting conversational AI with more powerful tools—suggests that measures of AI impact should transition from user engagement metrics to direct quantification of delegated work, systematization, and parallel agent deployment.

Future Directions

Future research should focus on isolating the causal impact of agentic AI on productivity at the organizational and societal levels, decomposing effects across substitution, augmentation, and redesign of work. Automating chains of interdependent tasks, formalizing agent-agent and agent-human interfaces, and studying emergent practices around verification and oversight will be crucial. There is also significant opportunity for studying how workforce reskilling must evolve, and how enterprise architectures co-evolve with increasingly agentic AI.

Conclusion

The evidence presented demonstrates that agentic AI systems, exemplified by Codex, are advancing the locus of AI adoption from interactive consultation towards full-scale delegated production. The profound increase in complexity, concurrency, and persistent workflow management among intensive users, most notably within OpenAI, provides a preview of anticipated transformations across knowledge industries. Unleashing the full general-purpose potential of agentic AI will require organizations to continually adapt workflows, institutional knowledge, and management structures, foregrounding delegation, integration, and oversight as central features of future work.

Markdown Report Issue

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

The Shift to Agentic AI: Evidence from Codex

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

What is this paper about?

This paper looks at a new kind of AI called “agentic AI.” Unlike a normal chatbot that mainly talks with you, agentic AI can actually do things for you: open files, run tools, write and edit code, draft documents, analyze data, and more. The authors study how people are starting to use OpenAI’s agentic tool, Codex, and how that changes the way they work.

The big questions the authors asked

Who is using agentic AI, and how fast is it spreading?
What kinds of tasks are people giving to these AI agents?
How are people organizing their work around agents (for example, running multiple tasks at once)?
What does this shift mean for productivity and jobs?

How did they study it?

The researchers analyzed real-world, privacy-protected usage data from Codex. They compared three groups:

Individual users (people with personal accounts)
Organizational users (people using Codex through their company)
OpenAI employees (who are very familiar with advanced AI and had strong incentives and support to try Codex)

To understand what people were doing, they used automated systems (think of them as smart sorters) that:

Labeled tasks (e.g., coding, writing, data analysis) based on what users asked Codex to produce.
Estimated task complexity by asking, “How long would this take an experienced human without AI?” (like under 1 hour, 1–8 hours, or more than a full day).
Measured “output tokens,” which are like tiny pieces of text (similar to Lego bricks of language). Counting tokens is a way to measure how much work the AI produced.
Tracked how many agents people ran at the same time (parallel “threads”) and how long agents worked on a user’s behalf.
Noted the use of “skills” (reusable instructions for complex workflows), which is like saving a recipe so you can repeat a multi-step process easily.

All of this was done without reading private messages: the pipeline produced aggregated, anonymized statistics.

What did they find?

1) Adoption is rapid but uneven

Codex usage grew very quickly in early 2026 (more than fivefold growth in active users).
Inside OpenAI, use is almost universal and has largely replaced standard chat use for work.
In companies outside OpenAI, agentic AI is spreading and taking a big share of work, but it’s less widespread than inside OpenAI.
Among individuals on personal accounts, adoption is still early and patchy.
Measuring “how much work the AI did” (output tokens) shows an even sharper shift than just counting users—people who adopt Codex often use it heavily.

Why this matters: It shows we’re moving beyond chatting with AI and toward delegating real work—especially where organizations have the right tools, access, and training.

2) People are delegating real production work, not just asking for advice

Users tell Codex to do hands-on tasks: debug code, refactor programs, validate changes, configure apps, draft documents, and analyze data.
Over time, tasks got more complex. Many more users now ask Codex to handle jobs that would take an experienced human many hours or even more than a day.

Why this matters: Agentic AI is shifting from “answering questions” to “doing jobs,” which can change how people plan and execute their work.

3) It started with coding, but grows broader as adoption deepens

The biggest chunk of use is still software-related: writing code, understanding large codebases, testing, managing applications, and keeping systems running.
Where adoption is deepest (like inside OpenAI), Codex is also used for research, planning, communication, data work, recruiting, sales, and more.

Why this matters: Agentic AI can plug into the full software lifecycle and, when teams get comfortable, spreads into general knowledge work.

4) Power users run large, repeatable, and parallel workflows

Many users now run multiple agents at the same time (think: managing a small team of AI helpers in parallel). More than 10% of users run three or more agents concurrently at least once a week.
Heavy users rely on longer-running tasks, reusable “skills,” and complex chains of steps.
Inside OpenAI, this looks like a new way of working: delegate, monitor, review, and coordinate several agents at once—less “type a request, get an answer,” more “manage a mini factory.”

Why this matters: When people organize work this way, AI isn’t just a smart assistant; it becomes a system you manage—like a set of digital coworkers.

Extra signs of change

Task complexity jumped: the share of users attempting multi-hour and full-day tasks grew dramatically.
Output exploded: for example, by June 2026 the median OpenAI employee in a legal role produced 13 times more monthly AI output tokens than in November 2025; median researchers produced over 50 times more.

Why is this important?

This shift could reshape how work is organized:

Productivity: If agents can take on bigger, longer tasks, people can get more done—especially when they run several agents in parallel.
Job roles: Work may shift from “doing every step yourself” to “delegating, supervising, and verifying.” Skills like planning, reviewing, and domain expertise (knowing what “good” looks like) become more valuable.
Organizations: Big gains come when companies redesign workflows around agents—giving access to the right files and tools, setting up review steps, training people, and sharing best practices. That’s why OpenAI (with strong support and training) shifted faster than most.

In simple terms: Think of agentic AI as a team of reliable digital helpers. The people and companies that learn to plan, assign, and check their work can move faster and do more.

Bottom line

The paper shows a clear move from chat-style AI to agentic AI that actually does work. Adoption is booming where teams have support and the right setup, tasks are getting more complex, and power users are managing multiple agents at once. If organizations redesign their processes to fit this new reality—teaching people how to delegate, review, and coordinate—agentic AI could bring large and lasting productivity gains and change how many jobs are done.

View Paper Prompt View All Prompts

Knowledge Gaps

Unresolved Knowledge Gaps, Limitations, and Open Questions

Below is a focused list of what remains missing, uncertain, or unexplored in the paper, framed to guide actionable future research:

Causal impact on productivity: No causal identification of how agentic AI affects task time, throughput, or quality (e.g., randomized access, staggered rollout, or instrumented feature gating to estimate treatment effects).
Output tokens vs. value: Reliance on token counts as a proxy for work/output lacks validation against business value, human time saved, or quality-adjusted deliverables.
Quality and error rates: Absence of measures of correctness, rework, defect rates, or downstream incidents resulting from agent-generated artifacts (e.g., code bugs, legal drafting errors).
Completion and success metrics: No tracking of whether agentic tasks reach successful completion, require human takeover, or fail silently; lack of “task success” KPIs for workflows.
Human oversight costs: Unmeasured verification, supervision, and coordination burden imposed on users managing agents; no time-use or attention-cost accounting.
Learning curves and skill acquisition: No longitudinal analysis of user learning, ramp-up dynamics, or which training interventions accelerate effective agentic use.
Confounding improvements: Inability to disentangle effects of model upgrades, UI/UX changes, pricing, and organizational initiatives on adoption and intensity.
External validity beyond OpenAI: OpenAI’s near-frictionless internal environment is acknowledged as atypical, but the paper does not quantify which complements (e.g., permissions, repo access, internal skills libraries) are necessary for replication in typical firms.
Organizational heterogeneity: Limited evidence on differences across industries, geographies, firm size/maturity, regulatory environments, and security postures; unclear barriers to adoption in less tech-centric contexts.
Job redesign and role evolution: No systematic measurement of how responsibilities, team structures, and managerial layers change when agentic workflows scale.
Seniority dynamics: While task mix by seniority is described, the causal implications for delegation patterns (e.g., manager-as-orchestrator vs. IC-as-implementer) remain untested.
Verification pipelines: Lack of concrete designs/benchmarks for effective review, audit trails, or automated guardrails that reduce supervisory burden without compromising safety.
Safety, compliance, and data governance: No analysis of data leakage risks, policy violations, or compliance incidents arising from agent tool access and file operations.
Economic tradeoffs and costs: No cost accounting for agent runtime, parallelism, or long-running tasks; missing cost-benefit analyses at user, team, and org levels.
Task taxonomy fidelity: Task labels are assigned from the initial request rather than the full action graph; execution-phase actions and outcomes may diverge from the labeled intent.
Classifier validity and bias: Automated classifiers (persona, job title, task type, complexity) have limited validation; potential misclassification across roles, domains, and languages is unexplored.
Complexity estimation scope: Task complexity is measured on a 0.1% sample of Individual users who opted in; no analogous estimates for organizational or OpenAI users, and no external ground-truth benchmarking.
Measuring actual “agency”: Tool invocation and thread structure are imperfect proxies for autonomous action; the paper does not specify or validate stronger agency metrics (e.g., autonomous toolchains, file diffs, multi-step execution without user input).
Concurrency interpretation: Parallel turns are measured, but cognitive load, interruption costs, coordination strategies, and the marginal value of additional concurrent agents are unmeasured.
Workflow persistence and reuse: Skill adoption is cited but not deeply analyzed; no measurement of retention, standardization, or performance gains from reusable skills/plugins over time.
End-to-end software outcomes: For software tasks, there is no linkage to repo-level outcomes (commit acceptance, rollback frequency, CI/CD failures, incident rates, mean time to restore).
Non-software work outcomes: For legal, sales, recruiting, and operations tasks, there are no domain-specific outcome metrics (e.g., contract accuracy, pipeline conversion, time-to-fill, SLA adherence).
Human-AI task boundary: Unclear which task segments (planning, implementation, validation) benefit most from delegation; no mapping of “automation adjacency” or chainable segments that drive the largest gains.
Inequality and distributional effects: No analysis of whether agentic AI widens performance dispersion across workers or firms, or how benefits accrue by skill level and role.
Retention and cohort dynamics: Adoption is presented cross-sectionally; missing cohort analyses of retention, escalation from conversational to agentic use, and saturation points.
Comparative benchmarks: No head-to-head, task-matched comparisons between Codex and conversational ChatGPT (or other agentic products) to quantify differential value on identical tasks.
Access and permissions constraints: The role of system access (files, repos, SaaS integrations) in enabling or constraining agentic workflows is asserted but not measured or experimentally varied.
Error recovery and escalation: No telemetry on how users detect, diagnose, and correct agent failures; absence of patterns or tools that reduce recovery cost.
Governance of shared skills: Unexplored questions about versioning, ownership, provenance, and quality assurance for shared skills within and across organizations.
Long-run dynamics: Short observation window (late 2025 to mid-2026) limits conclusions about durability, saturation, or post-novelty usage patterns.
Human factors and UX: No evaluation of the cognitive ergonomics of agent orchestration (thread design, notifications, dashboards) and their impact on outcomes.
Privacy-preserving analysis limits: The privacy pipeline restricts content inspection, which may systematically bias classification and task inference; alternative methods (e.g., secure enclaves, federated analytics) are not explored.
Policy and regulatory implications: The organizational prerequisites and controls needed to meet sectoral regulations (e.g., finance, health, public sector) are not examined.
Spillovers and complementarities: Interactions between agentic AI and adjacent tools (RPA, BI platforms, ticketing, design systems) and the resultant workflow synergies or conflicts are not measured.

View Paper Prompt View All Prompts

Practical Applications

Overview

Based on the paper’s evidence about rapid but uneven diffusion of agentic AI (OpenAI Codex) across individuals, organizations, and OpenAI itself—and its documented shifts in delegation, task complexity, parallelization, and workflow design—the following practical applications emerge for industry, academia, policymakers, and daily life. Where relevant, sector links, prospective tools/products/workflows, and feasibility assumptions/dependencies are included.

Immediate Applications

These can be deployed now with today’s agentic capabilities and standard enterprise IT practices.

Industry

Software engineering life cycle automation (software)
- Use agents for code implementation, debugging, refactoring, code understanding, validation, engineering operations, and application management; integrate into CI/CD and repo workflows to generate/validate PRs, maintain documentation, and manage environments.
- Tools/workflows: agentic IDEs; CI bots that run tests/lint/refactor; repo-aware “skills” libraries for repeatable runbooks; threaded agents for parallel tasks (e.g., refactor + doc + tests).
- Dependencies: repo and ticketing access (GitHub/GitLab/Jira); permissioning and audit logs; human-in-the-loop code review; model/tool reliability; cost/latency budgeting.
DevOps and IT operations runbooks (software, IT)
- Encode common operational runbooks as reusable “skills” to standardize configuration, deployment, and incident triage; enable long-running agents for routine checks with human approval gates.
- Tools/workflows: “Skills” repositories; approval workflows in chat/ITSM; agent action logs; recovery playbooks.
- Dependencies: secure tool access; change management; rollback plans; observability integration.
Knowledge artifact production at scale (legal, HR/recruiting, sales/marketing, product)
- Draft and iterate memos, proposals, contracts, job descriptions, interview packets, sales collateral, and product requirement documents with agentic workflows that pull from internal files.
- Tools/workflows: document agents tied to drive/wiki/CRM; templated skill packs; parallel thread drafting and SME review.
- Dependencies: content governance; source-of-truth linking; PII redaction; review/approval policies.
Data analysis and reporting (analytics, finance, operations)
- Delegate spreadsheet transformation, EDA, charting, and recurring report generation to agents; use “skills” for repeatable pipelines; route higher-complexity requests to more capable runs.
- Tools/workflows: analysis agents tethered to BI/warehouse; scheduled runs; thread-based revision cycles.
- Dependencies: read-only access to data sources (initially); validation checklists; ACLs and privacy controls.
Parallelized task management for power users (cross-function)
- Adopt threaded, concurrent delegation (e.g., running 2–5 agents at once) for complex projects; monitor, review, and merge outputs rather than serial “ask–answer.”
- Tools/workflows: agent dashboards; progress status summaries; concurrency guardrails (compute/priority).
- Dependencies: user training on delegation/verification; clear escalation paths; compute quotas.
Adoption analytics and governance (enterprise IT, risk)
- Track shift from conversational to agentic interfaces using token-share metrics; establish autonomy tiers, tool-access policies, and auditability for agent actions.
- Tools/workflows: usage dashboards; autonomy level catalog; action ledgers; exception monitoring.
- Dependencies: logging standards; RBAC; data-retention policies; internal buy-in.
Delegation and verification training (L&D, HR)
- Teach employees to scope tasks, set acceptance criteria, design verification, and reuse “skills”; emphasize supervision, error handling, and parallel thread management.
- Tools/workflows: internal playbooks; code/doc review checklists; “prompt-to-skill” templates.
- Dependencies: time for upskilling; departmental champions; feedback loops.

Academia

Privacy-preserving usage measurement and taxonomy research
- Apply the paper’s task-taxonomy and persona/job-title classifiers to study diffusion and task mix without inspecting content; measure complexity, runtime, concurrency.
- Tools/workflows: classifier prompts and labels; token-based intensity metrics; opt-in datasets.
- Dependencies: IRB approvals; anonymization pipelines; institutional data-sharing agreements.
Curricula for agent supervision and workflow design
- Incorporate agentic task design, verification, and multi-agent orchestration into CS/IS/business courses and capstones.
- Dependencies: access to agentic tooling; sandboxed repos/data; assessment rubrics focused on delegation.

Policy and Governance

Enterprise guidance on agent autonomy, logging, and access
- Issue internal standards for tool invocation, system access, and auditable action logs; define approval tiers by task risk.
- Dependencies: CISO/legal approval; secure integration; workforce communication.
Targeted upskilling support
- Leverage evidence that productivity gains concentrate where complements (skills, processes) exist to fund organizational training in delegation/verification.
- Dependencies: program funding; outcome metrics tied to adoption and quality.

Daily Life

Personal productivity automation
- Use agents for drafting resumes/letters, organizing notes, managing small projects, and learning tasks; encode personal “skills” (e.g., budgeting template updates, study plans).
- Tools/workflows: file-linked agents; recurring checklist skills; parallel drafting + revision threads.
- Dependencies: cautious file permissions; verification; awareness of privacy and data-sharing settings.
Parallel microtasking with supervision
- Run 2–3 concurrent threads for discrete tasks (e.g., travel plan + email draft + budget update) and review outputs.
- Dependencies: user familiarity with threaded UI; time to validate outputs; tolerance for occasional error.

Long-Term Applications

These require additional research, scaling, integration, or organizational change; they follow from observed frontier usage (e.g., heavy concurrency, long-running tasks, complex delegation) and the paper’s emphasis on complements.

Industry

Enterprise-wide multi-agent orchestration platforms (software, IT, cross-function)
- Operate fleets of agents orchestrated by “supervisor” agents/humans, coordinating dozens of parallel threads across engineering, operations, and business functions.
- Tools/products: AgentOps platforms (scheduling, retry, dependency graphs, SLAs); hierarchical agent architectures; cross-system connectors (ERP/CRM/ITSM).
- Dependencies: robust verification pipelines; strong identity and fine-grained permissions; reliability/latency SLAs; cultural adoption.
Delegation-first workflow redesign and new roles
- Redesign jobs around delegation, verification, and coordination; formalize roles like “AI production manager” and “agent supervisor” overseeing throughput and quality.
- Tools/workflows: delegation Kanban; automated acceptance tests; throughput/quality dashboards.
- Dependencies: job architecture changes; incentives; performance evaluation aligned to oversight.
Sector-specific scaled deployments
- Healthcare: clinical documentation, prior authorization packets, quality reporting—agent-generated with clinician verification.
- Dependencies: HIPAA/GDPR compliance; EHR integrations; medical QA; liability frameworks.
- Education: large-scale tutoring, rubric-aligned grading assistance, courseware generation with instructor review.
- Dependencies: academic integrity policies; LMS integrations; fairness/consistency checks.
- Finance: reconciliations, regulatory reporting, audit preparation with immutable action ledgers.
- Dependencies: model risk management; SOX-compliant logs; segregation of duties; data lineage.
- Legal: e-discovery triage, contract lifecycle automation with redline explainability and approval workflows.
- Dependencies: privilege protection; explainable diffs; firm-specific clause libraries.
- HR/Recruiting: sourcing, screening artifact prep, structured interview packs, candidate comms.
- Dependencies: bias mitigation; consent; ATS integrations; auditability.
- Energy/Utilities: operations documentation, maintenance planning, grid optimization analyses.
- Dependencies: secure OT/IT separation; systems models; fail-safe controls.
Verification-at-scale and safety pipelines
- Continuous verification, synthetic test generation, proof obligations for agent actions; automated “gatekeepers” for high-risk steps.
- Tools/products: test harnesses for non-code artifacts; policy checks; red-teaming simulators.
- Dependencies: domain test libraries; human adjudication; traceability.
Skill marketplaces and interoperability standards
- Internal/external marketplaces for reusable “skills” with versioning, permissions, and provenance; cross-platform skill standards.
- Dependencies: vendor-neutral formats; signing/attestation; governance for updates and deprecation.
Complexity-aware task routers and schedulers
- Route tasks by estimated human-hours and risk level to appropriate agents/humans; schedule long-running jobs for off-peak compute.
- Dependencies: accurate complexity estimation; queueing infrastructure; escalation rules.

Academia

Longitudinal studies on organizational complements and productivity
- Measure how redesigned workflows (parallelization, skill reuse, supervision intensity) map to firm-level outcomes; replicate across industries.
- Dependencies: multi-tenant data access; standardized metrics; cooperation from firms.
Benchmarks and evaluation frameworks for agentic tasks
- Create benchmarks for delegated multi-step tasks (beyond chat), including verification quality and supervision cost.
- Dependencies: reproducible task suites; measurement standards; community adoption.

Policy and Governance

Regulatory standards for agent action logging and accountability
- Mandate actionable, tamper-evident logs for agent actions; define responsibility in delegated workflows (human-on-the-loop).
- Dependencies: technical standards bodies; industry alignment; enforcement mechanisms.
Data privacy and access frameworks for agentic execution
- Update data-protection rules to cover agents that execute commands/read files; standardize consent and minimization for tool use.
- Dependencies: legal clarity; certifiable controls; third-party audits.
Workforce transition and reskilling programs at scale
- Invest in large-scale training for agent supervision, verification, and domain expertise; support transitions as roles shift.
- Dependencies: funding; credentialing pathways; outcome tracking.
Interoperability for tool/skill ecosystems
- Support open APIs and schemas so skills/agents interoperate across vendors and enterprise systems.
- Dependencies: open standards; vendor participation; security reviews.

Daily Life

Personal multi-agent “household ops” (longer horizon)
- Coordinated agents for finance management, home maintenance scheduling, learning plans, and trip logistics with shared context and calendars.
- Dependencies: secure integrations (banks, utilities, calendars); household policy settings; robust fail-safes.
Agent-mediated marketplaces and concierge services
- Agents negotiate appointments, purchases, and services under user policies, with receipts and audit trails.
- Dependencies: merchant APIs; identity/payment safeguards; dispute resolution frameworks.

Notes on Feasibility and Dependencies

Organizational complements are critical: the paper’s evidence shows adoption is deepest where training, access, and review processes exist (e.g., within OpenAI).
Secure integration is a gating factor: value depends on access to files, repos, and tools with proper permissions and auditability.
Verification and supervision are central: as complexity and autonomy increase, human review and automated checks determine realized productivity gains.
Workflow adjacency and chaining matter: productivity is highest when agents can execute contiguous task chains; fragmented processes limit gains.
Heterogeneous adoption persists: non-technical roles can benefit, but require tailored skills, templates, and training.
Cost, latency, and reliability constraints will shape parallelization and long-running agent use until infrastructure and models improve.

View Paper Prompt View All Prompts

Glossary

Agentic AI: AI systems that can autonomously take actions on a user’s behalf, beyond simple conversation. "agentic AI technology, which can take actions on a user's behalf,"
Application management: Tasks related to configuring, operating, and maintaining software applications. "Engineering operations, code implementation, code understanding, application management, and code validation account for a large share of Codex activity across groups."
Business-function workflows: Delegated processes tied to specific business domains (e.g., sales, recruiting, marketing). "as well as broader knowledge-work categories such as data analysis, research, knowledge artifacts, collaboration, and business-function workflows."
Code implementation: Creating or modifying program code to add features or fix issues. "including code implementation, code understanding, code validation, engineering operations, and application management."
Code understanding: Analyzing existing code to comprehend behavior, structure, or dependencies. "including code implementation, code understanding, code validation, engineering operations, and application management."
Code validation: Verifying that code changes are correct, robust, and meet requirements (e.g., via tests, checks). "including code implementation, code understanding, code validation, engineering operations, and application management."
Concurrency: Running multiple AI tasks at the same time across different threads or agents. "Concurrency measures whether users run multiple Codex turns at the same time."
Delegated production: Having AI carry out concrete work tasks end-to-end rather than merely advising. "Codex use is strongly oriented toward delegated production."
Delegated workflow: A sequence of steps handed off to AI to execute autonomously toward a user-defined goal. "the relevant unit of analysis is a delegated workflow rather than a conversation."
Electrification: Historical shift to electric power used as an analogy for reorganizing production around new technology. "In the early stages of electrification, many factories replaced centralized steam engines with centralized electric motors while preserving existing factory layouts and work patterns."
Engineering operations: Operational tasks supporting software engineering (e.g., CI/CD, environment setup, repo management). "including code implementation, code understanding, code validation, engineering operations, and application management."
Extensive margin: Whether users adopt or use a tool at all, irrespective of intensity. "Panel A shows the extensive margin: whether active users of either product use Codex at all."
General-purpose technologies: Broad innovations that enable widespread changes in production and productivity. "the literature on general-purpose technologies suggests that the largest productivity gains often arise when firms reorganize production around the new technology rather than merely substitute it into existing workflows."
Human–AI collaboration: Joint work where humans and AI systems contribute complementary capabilities. "increased demand and skill complexity in jobs involving human--AI collaboration."
Intensive margin: How much a tool is used among adopters, often measured by output share or volume. "Panel B shows the intensive margin: the share of output tokens produced through Codex rather than ChatGPT."
Knowledge artifacts: Written or structured outputs that codify knowledge (e.g., docs, specs, reports). "Inside OpenAI, across developer and non-developer roles, knowledge artifacts, collaboration, and application management are common tasks."
Organizational complements: Processes, skills, and structures that firms must develop to realize value from new tech. "technology diffusion, organizational complements, and workplace change."
Output tokens: The model-generated token units used to quantify AI output volume. "Panel B shows the intensive margin: the share of output tokens produced through Codex rather than ChatGPT."
Persona classifier: An automated method to label users by usage persona (e.g., Developer, General Knowledge Worker). "We validated the persona classifier using a small sample of employees."
Runtime: The amount of time an agent is actively working on a user’s behalf. "Runtime measures how much active agent work occurs on a user's behalf."
Skills: Reusable, shareable instructions or integrations that encode complex, repeatable workflows. "skills, which allow users to share instructions for complex workflows."
Task complexity: An estimate of human time required to complete a delegated task without AI. "Task-complexity measures the estimated time it would take an experienced human to complete the tasks that users delegate."
Task taxonomy: A structured label space for categorizing tasks delegated to AI. "we classify Codex requests into a fixed two-level task taxonomy."
Technology diffusion: The spread of new technologies across users, firms, and contexts. "technology diffusion, organizational complements, and workplace change."
Threaded interaction model: Interface paradigm where multiple agent threads run independently and in parallel. "Codex, like many AI agents, uses a threaded interaction model in which users can initiate multiple agents and interact with each one in a largely independent workspace."
Tool invocation: Calls made by an AI agent to external tools or services during execution. "some tool invocations are part of simple conversational interactions"
Turn: A discrete unit of interaction or execution within an agent thread. "we calculate the number of overlapping turns they have in different threads"
Verification: Processes to check, review, or validate AI-produced work for correctness and quality. "making supervision, verification, and coordination central determinants of value creation"
Workflow system: A coordinated environment for delegating, monitoring, and integrating multiple streams of AI work. "Codex is less an assistant answering requests and more like a workflow system in which the user delegates, monitors, reviews, and coordinates multiple streams of work."

View Paper Prompt View All Prompts

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

The Shift to Agentic AI: Evidence from Codex

Summary

The Shift to Agentic AI: A Comprehensive Analysis of Codex Usage and Implications

Introduction

Methodological Framework

Patterns of Adoption and Output

Evolution of Delegated Work and Task Complexity

Workflow Transformation: Concurrency, Runtime, and Systematization

Task Domain and Persona Analysis

Evaluation and Implications

Theoretical and Practical Consequences

Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

What is this paper about?

The big questions the authors asked

How did they study it?

What did they find?

1) Adoption is rapid but uneven

2) People are delegating real production work, not just asking for advice

3) It started with coding, but grows broader as adoption deepens

4) Power users run large, repeatable, and parallel workflows

Extra signs of change

Why is this important?

Bottom line

Knowledge Gaps

Unresolved Knowledge Gaps, Limitations, and Open Questions

Practical Applications

Overview

Immediate Applications

Industry

Academia

Policy and Governance

Daily Life

Long-Term Applications

Industry

Academia

Policy and Governance

Daily Life

Notes on Feasibility and Dependencies

Glossary

Open Problems

Continue Learning

Collections

Tweets

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research