Papers
Topics
Authors
Recent
Search
2000 character limit reached

Generative AI at Work

Published 23 Apr 2023 in econ.GN, q-fin.EC, and q-fin.GN | (2304.11771v2)

Abstract: We study the staggered introduction of a generative AI-based conversational assistant using data from 5,172 customer support agents. Access to AI assistance increases worker productivity, as measured by issues resolved per hour, by 15\% on average, with substantial heterogeneity across workers. Less experienced and lower-skilled workers improve both the speed and quality of their output while the most experienced and highest-skilled workers see small gains in speed and small declines in quality. We also find evidence that AI assistance facilitates worker learning and improves English fluency, particularly among international agents. While AI systems improve with more training data, we find that the gains from AI adoption are largest for relatively rare problems, where human agents have less baseline training and experience. Finally, we provide evidence that AI assistance improves the experience of work along two key dimensions: customers are more polite and less likely to ask to speak to a manager.

Citations (283)

Summary

  • The paper demonstrates a 15% increase in resolutions per hour, particularly benefiting less experienced customer service agents.
  • The paper analyzes 3 million chat records to assess handle time, resolution rates, and customer satisfaction as key productivity metrics.
  • The paper finds a 38% adherence rate to AI recommendations, indicating that diligent use of AI accelerates agent learning and narrows skill gaps.

Generative AI at Work: A Study on Productivity and Work Experience in the Customer Service Industry

The paper "Generative AI at Work" presents an empirical analysis of the implementation of generative AI in the form of conversational assistants in a large-scale customer support setting. Using data from 5,172 customer service agents, the authors assess the impacts of AI-driven assistance on worker productivity, worker learning, and the overall experience of work. This study provides valuable insights into the role of AI technologies in workplaces, specifically within the customer service sector, which is known for its high adoption rates of AI.

Overview of Findings

The authors employ a comprehensive dataset encompassing 3 million chat records to explore the effects of AI deployment on key productivity metrics: resolutions per hour, average handle time, chats handled per hour, resolution rates, and customer satisfaction. Their analysis reveals that AI assistance results in a 15% increase in resolutions per hour, with less experienced and lower-skilled workers experiencing the most pronounced gains. Importantly, AI assistance helps newer agents ascend the learning curve faster, reducing their ramp-up time significantly. Interestingly, while less experienced agents show marked improvements, more experienced agents exhibit smaller gains, suggesting differential impacts of AI assistance across skill levels.

Another pivotal finding concerns the dynamics of AI recommendation adherence. Agents exhibit a 38% adherence rate to AI suggestions, with those adopting recommendations more diligently reaping greater productivity benefits. Over time, adherence increases, particularly among initially skeptical or more experienced workers. This adherence pattern indicates a learning effect facilitated by AI advice, further corroborated by the observation that workers perform better even during system outages if they have had prior exposure to AI assistance.

Theoretical and Practical Implications

Theoretically, this study challenges the conventional narrative of skill-biased technological change, suggesting that generative AI may narrow the productivity gap between lower and higher-skilled workers within the same occupation by disseminating best practices. This raises new questions about the interplay between AI tools and tacit knowledge, as these systems provide not only performance support but also act as vehicles for skill transference and tacit knowledge dissemination.

Practically, the deployment of AI in customer service settings shows potential for transformative impacts. Organizational strategies could be refined to focus on high-impact AI training for less experienced workers, while also addressing potential declines in quality among top performers due to over-reliance on AI recommendations. These findings also suggest that AI can play a significant role in improving the overall employee experience by reducing negative customer interactions and enhancing communication skills, particularly for international agents.

Future Developments and Speculations

Looking forward, the implications of this study suggest several avenues for future research and policy development in the field of AI in the workplace. The long-term effects of AI on job design, employee satisfaction, and labor demand require further investigation. Future studies might also explore the macroeconomic effects of AI adoption across different sectors and occupations, focusing on how AI-facilitated learning can influence the trajectory of workforce development.

Moreover, questions regarding the compensation of workers who contribute to AI training datasets are becoming increasingly relevant. As AI systems evolve and potentially reshape the landscape of skill acquisition and deployment, it will be crucial for organizations and policymakers to develop frameworks that balance innovation with fair compensation and job satisfaction.

In conclusion, the paper "Generative AI at Work" provides a detailed examination of the first large-scale deployment of generative AI in a workplace setting, offering insights into the nuanced impacts of AI technologies on productivity and work environment. It contributes to our understanding of AI's role in modern work settings and paves the way for further exploration into the socio-economic implications of such technologies.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Knowledge Gaps

Unresolved knowledge gaps, limitations, and open questions

Below is a concise, actionable list of what remains missing, uncertain, or unexplored in the paper.

  • External validity beyond a single firm and function: Does the observed 15% productivity gain generalize to other firms, industries, job types (e.g., phone support, sales, healthcare), languages, and cultural contexts?
  • Long-run impacts: How persistent are productivity and learning effects over multiple years? Do gains plateau or decay, and do agents deskill or overfit to AI prompts over time?
  • Labor-market outcomes: What are the effects on wages, bonuses, promotion rates, hiring composition, staffing levels, and overall labor demand at the team/firm level?
  • Firm-level performance: How do AI-assisted productivity gains translate into profitability, revenue, customer retention, and service-level agreements (SLAs)?
  • Cost–benefit evaluation: What is the return on investment once licensing, onboarding time, supervision changes, and model maintenance costs are netted against productivity and retention gains?
  • Identification and selection: Managers scheduled onboarding to minimize service disruptions rather than randomly; how much does non-random timing bias treatment effects despite difference-in-differences and team-level instruments?
  • Spillovers and contamination: Do treated agents affect untreated teammates via knowledge sharing, queue dynamics, or manager coaching reallocations, biasing estimated treatment effects?
  • Parallel trends validation: Are there pre-trend diagnostics at the agent or team level that support the difference-in-differences identification assumptions?
  • Outage-based learning inference: Are AI “outages” exogenous with respect to workload, time-of-day, or problem mix? Could correlated shocks confound the interpretation of durable learning?
  • Model drift and versioning: Did underlying LLMs or fine-tuning parameters change during the rollout, and if so, how do model updates affect estimated impacts?
  • Mechanism clarity: Through which concrete channels do gains arise (e.g., improved triage, better probing questions, faster document retrieval, tone/empathy templates, multitasking support), and how do these channels differ by skill level?
  • Adherence and anchoring: Why do top performers adhere to suggestions that marginally reduce quality? Is there AI-induced anchoring, loss of originality, or reduced exploration by experts?
  • Quality measurement gaps: Resolution rate and NPS are noisy and incomplete (especially for subcontractors); what happens to other quality measures like first-contact resolution, reopen/recidivism rates, refunds, escalations, or compliance errors?
  • Hallucination and error audits: How often are AI suggestions factually incorrect, off-topic, or misleading, and what is the downstream impact on customer outcomes and rework?
  • Topic rarity and difficulty: Gains appear larger on rare issues; are “rare” topics also harder? Can researchers separate rarity from complexity to understand where AI adds the most value?
  • Customer outcomes beyond sentiment: Do improvements in tone and reduced supervisor requests translate into lower churn, higher renewals, upsells, or long-run satisfaction?
  • Heterogeneity beyond experience/skill: How do effects vary by geography, shift (overnight vs daytime), customer segment, product line, or agent language proficiency levels?
  • International language effects: English fluency gains are suggested—are these validated with standardized language assessments, and do they generalize to non-English support channels?
  • UI and product design: Which interface/design choices (e.g., suggestion frequency, explanation visibility, ranking, timing) drive adherence and performance? Are there optimal thresholds for intervention?
  • Prompting and human-in-the-loop practices: How do prompting strategies, coaching on when to accept/reject suggestions, or calibrated confidence scores affect outcomes and reduce overreliance?
  • Organizational complements: Did coaching time, training curricula, or knowledge-base maintenance change post-adoption? What complementary investments are required to realize gains at scale?
  • Equity and bias: Do AI suggestions embed or amplify biases (e.g., against non-native speakers), and do gains differ systematically across demographic groups or contractors vs in-house staff?
  • Data governance and incentives: How should firms compensate top performers whose data improve the model? What governance prevents model degradation if expert behaviors are crowded out?
  • Privacy, security, and compliance: How are sensitive customer data used in fine-tuning handled, and what are the legal/ethical implications in regulated contexts?
  • Robustness to alternative interventions: How do AI-assistance gains compare to lower-cost alternatives (e.g., better scripts, enhanced knowledge bases, targeted coaching, or search tools)?
  • Workload and multitasking trade-offs: Does AI encourage higher concurrency (more simultaneous chats) at the expense of depth or accuracy, and where is the optimal balance?
  • Team and queue dynamics: With faster agents, do queue assignments, wait times, or customer selection change in ways that confound agent-level measures?
  • Saturation and scale: As adoption approaches 100%, do average effects change due to general-equilibrium adjustments in staffing, routing, or customer expectations?
  • Generalization to voice channels: Do similar effects hold in phone-based support where prosody and real-time pacing matter and textual suggestions are harder to use?
  • Safety nets and escalation: When the model withholds suggestions due to low confidence, how are those cases handled, and what is the impact on performance and learning?
  • Measuring originality and knowledge creation: Does reliance on AI reduce novel problem-solving by experts and slow discovery of new best practices or documentation updates?
  • Long-term retention and well-being: The paper notes short-run attrition changes; what are the long-run effects on burnout, stress, job satisfaction, and career progression?
  • Customer authenticity perceptions: Do customers detect templated or AI-influenced language over time, and does perceived inauthenticity eventually erode satisfaction?
  • Cross-functional effects: Do AI-assisted transcripts create spillover value for product, engineering, or marketing teams (e.g., faster bug detection) that alter the net impact?
  • Crisis-period confounds: The 2020–2021 rollout overlaps with pandemic-related shocks; to what extent do unusual demand patterns or remote work environments affect generalizability?

Practical Applications

Immediate Applications

Below are actionable use cases that can be deployed now, grounded in the paper’s findings (15% average productivity gain; ~30% for less-experienced agents; durable learning; improved English fluency; larger gains on rare issues; improved customer sentiment; fewer escalations; reduced new-worker attrition).

  • Deploy an LLM-based “copilot” for live chat support (customer service, software/BPO)
    • Tools/products/workflows: Fine-tuned LLM integrated into CRM/CCaaS (e.g., Salesforce Service Cloud, Zendesk, Genesys), offering real-time reply suggestions and knowledge-base links; human-in-the-loop edit/approve; confidence gating for low-data situations.
    • Assumptions/dependencies: Sufficient domain chat logs for fine-tuning; robust guardrails (RLHF, toxicity filters); PII redaction and compliance (GDPR/CCPA); agent uptake and UI integration; reliable latency and uptime.
  • Prioritized rollout to novices and lower-performing teams for maximum ROI (operations, workforce management)
    • Tools/products/workflows: License allocation rules favoring <6-month tenure and lower QA scores; staggered onboarding to maintain queue coverage; adherence analytics to identify who benefits most.
    • Assumptions/dependencies: Management willingness to target licenses; training capacity across time zones; change management support.
  • Accelerate onboarding and shorten the “experience curve” (HR/L&D)
    • Tools/products/workflows: Pair AI copilot with early training cohorts; embed exemplar prompts and tone templates; use AI-labeled transcripts for role-play and feedback.
    • Assumptions/dependencies: Curated prompt library; baseline training content; metrics to track time-to-proficiency and quality.
  • Best-practice diffusion and coaching at scale (HR/L&D, QA)
    • Tools/products/workflows: Use AI to surface “top-performer style” patterns (empathetic phrasing, better diagnosis), provide side-by-side rewrites, and build weekly manager coaching guides from chat logs.
    • Assumptions/dependencies: High-quality historical transcripts labeled for outcomes; manager bandwidth; privacy-safe analytics.
  • Language support for non-native English agents (education/L&D, global operations)
    • Tools/products/workflows: Inline rewrites for fluency and tone; microlearning modules generated from agent chats; “explain why” feedback to reinforce learning.
    • Assumptions/dependencies: Strong LLM English performance; culturally appropriate tone guidelines; non-punitive learning culture.
  • Rare-issue triage workflow (support operations)
    • Tools/products/workflows: Topic detection flags rare/low-frequency issues; route those cases to AI-augmented agents; AI retrieves niche documentation and past resolved cases.
    • Assumptions/dependencies: Topic taxonomy; metadata on issue frequency; knowledge base freshness; low-confidence fallback.
  • Customer experience uplift and de-escalation (CX)
    • Tools/products/workflows: Standardize empathetic/opening/closing phrasing; monitor sentiment and escalation rates; A/B test AI tone libraries for impact on NPS and supervisor-requests.
    • Assumptions/dependencies: Sentiment pipeline; QA acceptance of tone shifts; alignment with brand voice.
  • Scheduling and hiring strategy adjustments (HR/workforce planning)
    • Tools/products/workflows: Increase share of entry-level hires in AI-enabled queues; rebalance shift coverage given higher throughput; update bonus structures to avoid penalizing deliberate diagnosis.
    • Assumptions/dependencies: Local labor market supply; compliance with contracting terms; continuous monitoring of quality.
  • Measure “learning beyond assistance” using planned AI-off windows (analytics, academia/industry R&D)
    • Tools/products/workflows: Schedule brief, staggered “AI-off” periods (akin to outages) to estimate durable skill transfer; tie gains to prior adherence levels.
    • Assumptions/dependencies: Ethical/operational approval; careful scheduling to avoid service degradation; robust measurement.
  • Extend to adjacent channels (email/tickets/knowledge-base authoring) (IT helpdesk, enterprise support)
    • Tools/products/workflows: Outlook/Gmail plug-ins for suggested responses; ticket triage and summary generation; draft knowledge articles from resolved chats.
    • Assumptions/dependencies: Email/ticket data access; human review; documentation standards.
  • Operational guardrails and governance (software governance, policy compliance)
    • Tools/products/workflows: Confidence thresholds; block suggestions when training coverage is thin; automated hallucination checks; audit logs linking suggestions to outcomes.
    • Assumptions/dependencies: Model monitoring; calibration datasets; legal review; incident response playbooks.
  • Well-being and retention monitoring (HR)
    • Tools/products/workflows: Track reductions in abusive language and escalations as a wellness KPI; incorporate into retention risk dashboards; offer targeted support.
    • Assumptions/dependencies: Ethical sentiment analysis; informed consent; anonymization.
  • Public-sector and education service desks (government, education)
    • Tools/products/workflows: Deploy AI assistants for university IT/registrar helpdesks and municipal service portals with human oversight; pre-validated response sets.
    • Assumptions/dependencies: Privacy/security constraints; accessible language and multilingual needs; procurement frameworks.

Long-Term Applications

These opportunities require additional R&D, scaling, integration, or governance design before broad deployment.

  • Hybrid automation: AI handles common intents, humans handle edge cases (customer service, telecom, e-commerce)
    • Tools/products/workflows: Intent classification with automated workflows for high-confidence, frequent issues; dynamic handoff to humans with transcript summaries.
    • Assumptions/dependencies: High-precision intent models; safety nets for “jagged frontier” tasks; continuous evaluation to prevent quality drift.
  • Organizational redesign around a compressed productivity distribution (operations, HR)
    • Tools/products/workflows: Rebalance coaching from basics to advanced diagnostics; redesign KPIs to reward quality and appropriate skepticism; update bonus schemes to avoid over-alignment with AI when it reduces expert quality.
    • Assumptions/dependencies: Longitudinal QA data; change-management buy-in; union/contractual considerations.
  • Data contribution incentives for experts (policy, platform economics)
    • Tools/products/workflows: “Data dividends” or recognition programs for novel solutions; sampling strategies that overweight expert contributions; gated adherence nudges for experts.
    • Assumptions/dependencies: Legal frameworks for data ownership; privacy; platform accounting for contribution value.
  • Dynamic routing by “AI complementarity score” (workforce optimization, software)
    • Tools/products/workflows: Match cases to agents whose outcomes improve most with AI based on topic, language, and complexity; adaptive routing over time.
    • Assumptions/dependencies: Per-agent uplift models; fair allocation policies; monitoring for unintended bias.
  • Knowledge systems that prioritize rare problems (knowledge management)
    • Tools/products/workflows: Automatically elevate rare-issue solutions into curated articles; active-learning loops that seek supervision on low-coverage topics.
    • Assumptions/dependencies: Editorial workflows; human validation; feedback capture from resolved cases.
  • Sectoral expansion to regulated domains (healthcare admin, insurance, finance, legal intake, government service)
    • Tools/products/workflows: AI assistants for coded interactions (eligibility questions, claim status, benefits guidance) with strict human oversight and compliance templates.
    • Assumptions/dependencies: Domain-specific fine-tuning; audit trails; strict PII/PHI controls; liability frameworks.
  • Standardized evaluation for human–AI collaboration (academia, industry standards)
    • Tools/products/workflows: Benchmarks separating “within-capability” vs “outside-capability” tasks; adherence-quality tradeoff metrics; standardized A/B protocols with AI-off tests.
    • Assumptions/dependencies: Shared datasets; cross-firm collaborations; IRB/ethics for worker studies.
  • Multilingual and cross-cultural assistants (global operations)
    • Tools/products/workflows: Native-language suggestions and localized empathy styles; cross-lingual KB retrieval; dynamic tone calibration by locale.
    • Assumptions/dependencies: Strong multilingual LLMs; cultural validation; translation QA.
  • Well-being policy and occupational health (policy, HR)
    • Tools/products/workflows: Recognize AI-mediated de-escalation as a psychosocial risk mitigant; set guidelines for minimum guardrails to reduce abuse; possible incentives for adoption.
    • Assumptions/dependencies: Evidence generalization beyond the studied firm; stakeholder engagement; measurable standards.
  • Education and workforce development built around AI-mediated learning (education, public policy)
    • Tools/products/workflows: Micro-credentials for AI literacy in support roles; L2 English programs that use AI feedback loops; public subsidies tied to proven productivity gains.
    • Assumptions/dependencies: Curriculum design; equitable access; quality assurance.
  • “AI adherence dashboards,” confidence meters, and prompt libraries as products (software)
    • Tools/products/workflows: Commercial analytics to track when following AI helps/hurts; per-topic prompt/tone packs; configuration for brand voice and risk thresholds.
    • Assumptions/dependencies: Standardized APIs; vendor partnerships; privacy-preserving telemetry.
  • Continuous learning with privacy-preserving data pipelines (software infrastructure)
    • Tools/products/workflows: Federated or differential-privacy updates to incorporate new edge cases without exposing PII; automated regression tests for model updates.
    • Assumptions/dependencies: Mature MLOps; legal clearance; compute budgets.
  • Labor market and policy responses to shifting demand (macro policy)
    • Tools/products/workflows: Monitoring frameworks for wage/occupation effects; displacement mitigation (wage insurance, reskilling vouchers); procurement rules specifying human-in-the-loop in public services.
    • Assumptions/dependencies: Longitudinal labor data; political consensus; fiscal resources.
  • Cross-functional adoption (sales, HR, internal IT helpdesks) (enterprise software)
    • Tools/products/workflows: AI suggestions for objection handling in sales; HR ticketing and policy Q&A; internal IT support triage.
    • Assumptions/dependencies: Domain adaptation; clear escalation paths; quality controls.

Each application above reflects specific findings of the paper (who benefits, when, and why) and notes key assumptions/dependencies that will influence feasibility and impact.

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 10 tweets with 13 likes about this paper.