Papers
Topics
Authors
Recent
Search
2000 character limit reached

A paradox of AI fluency

Published 28 Apr 2026 in cs.CL | (2604.25905v1)

Abstract: How much does a user's skill with AI shape what AI actually delivers for them? This question is critical for users, AI product builders, and society at large, but it remains underexplored. Using a richly annotated sample of 27K transcripts from WildChat-4.8M, we show that fluent users take on more complex tasks than novices and adopt a fundamentally different interactional mode: they iterate collaboratively with the AI, refining goals and critically assessing outputs, whereas novices take a passive stance. These differences lead to a paradox of AI fluency: fluent users experience more failures than novices -- but their failures tend to be visible (a direct consequence of their engagement), they are more likely to lead to partial recovery, and they occur alongside greater success on complex tasks. Novices, by contrast, more often experience invisible failures: conversations that appear to end successfully but in fact miss the mark. Taken together, these results reframe what success with AI depends on. Individuals should adopt a stance of active engagement rather than passive acceptance. AI product builders should recognize that they are designing not just model behavior but user behavior; encouraging deep engagement, rather than friction-free experiences, will lead to more success overall. Our code and data are available at https://github.com/bigspinai/bigspin-fluency-outcomes

Summary

  • The paper reveals that expert users, despite higher visible failure rates, achieve greater success on complex tasks through iterative correction.
  • It leverages a large-scale analysis of 26,958 conversations from WildChat-4.8M, employing multi-tier fluency assessments and failure mode taxonomy.
  • The findings emphasize the need for redesigned AI interfaces and user training to foster augmentative engagement and reduce undetected errors.

AI Fluency and the Paradox of User Expertise

Introduction

The study "A paradox of AI fluency" (2604.25905) investigates the interplay between user expertiseโ€”โ€œAI fluencyโ€โ€”and the actual outcomes users achieve with state-of-the-art conversational AI systems. Leveraging a large-scale, richly-annotated sample from the WildChat-4.8M dataset, the authors analyze how fluency shapes not just task complexity and userโ€“AI interaction modes, but also the nuanced taxonomy of failure modes that emerge. The central finding is a counterintuitive โ€œfluency paradoxโ€: expert users encounter more overt failures than novices, yet their sophisticated engagement also yields greater successโ€”especially on complex tasksโ€”while minimizing insidious, undetected (โ€œinvisibleโ€) failures that novices disproportionately face.

Data and Annotation Methodology

The dataset is sampled from WildChat-4.8M, comprising deidentified transcripts of userโ€“AI chats with various iterations of GPT-3.5 and GPT-4 models. Analyses are based on 26,958 English-language conversations annotated monthly from May 2023 to July 2025. Annotations encompass:

  • AI Fluency: Multi-tiered assessment (minimal, low, moderate, high) grounded in a behaviorally explicit protocol distinguishing augmentative (iterative, collaborative) versus delegative (passive) engagement styles.
  • Task Complexity: Five-point scale incorporating facets such as cognitive challenge, domain expertise demands, ambiguity, and novelty.
  • Failure Modes: Extension of prior work [potts2026invisible], with LLM-in-the-loop pipelines tagging visible/invisible failures and classifying archetypes of invisible failure (e.g., Confidence Trap, Silent Mismatch, Death Spiral).

Special care is taken to control for exogenous data artifacts (spikes from viral prompting strategiesโ€”"Midjourney"โ€”and multi-user gaming dialogsโ€”"Blockman"), focusing core analyses on a โ€œStandardโ€ subset cleansed of these phenomena.

Across all data variants, high-fluency users are rare and their proportion remains stably low over time, while low and minimal-fluency categories dominate usage and exhibit dynamic trends. The effect of exogenous events is substantial in shifting the apparent aggregate fluency distribution, underscoring the need to disentangle underlying user behaviors from transient viral interaction modes. Figure 1

Figure 1: Overall fluency distribution across the "Standard," "Midjourney," and "Blockman" variants, highlighting the systematic impact of exogenous data subsets.

Figure 2

Figure 2: Temporal dynamics in fluency levels, exposing the influence of exogenous events and persistent prevalence of minimal and low fluency.

A salient result is the near-perfect alignment between high fluency and augmentative interaction style: 93% of high-fluency users demonstrate iterative refinement and critical engagement, versus under 1% among minimal-fluency users. Figure 3

Figure 3: Interactional style distribution, with augmentative behaviors strongly concentrated in high-fluency users in the Standard dataset.

Behavioral analysis reveals that iterative refinement, active correction, and explicit goal negotiation are hallmarks of high fluency. Conversely, low-fluency users disproportionately display passive acceptance and anti-fluency behaviors. Figure 4

Figure 4

Figure 4: Mean rates of fluency and anti-fluency behaviors across fluency levels.

Figure 5

Figure 5: Granular distribution of fluency and anti-fluency markers by fluency category, emphasizing absence of iterative strategies in low-fluency users.

The Fluency Paradox: Failure and Success

A central empirical contribution concerns the relationship between fluency and failure. Notably, both the rate and type of failure vary systematically with fluency. Higher fluency users exhibit sharply higher overall failure rates (64% for high fluency vs. 24% for minimal), yet these are dominated by visible, recoverable failures (59% of high-fluency failures are visible) arising from active engagement and correction. In contrast, low-fluency users experience failure that is predominantly invisible (86% of failures among minimal fluency are invisible), resulting in the user exiting the interaction with undetected errors or misaligned results. Figure 6

Figure 6: Failure rates stratified by fluency. High-fluency users show more failures, with a greater proportion being visible.

Crucially, the increased overt failure among experts accompanies a pronounced willingness to attempt and succeed at substantially more complex tasks, evidenced by a 1.6-point mean complexity gap (3.1 vs. 1.5). Figure 7

Figure 7: Relationship between task complexity, fluency, and success. High-fluency users succeed more often on substantially harder tasks.

Regression analysis confirms that fluency is a statistically significant predictor of both success (positive coefficient) and failure visibility (positive coefficient, p<0.01p < 0.01 for both), controlling for conversation length, complexity, and domain effects.

Subtlety in Failure Archetypes

Analyzing invisible failure archetypes reveals a qualitative shift linked to fluency. High-fluency users are mostly associated with recoverable failures (โ€œpartial recoveryโ€) and visible contradictions (โ€œcontradiction unravel,โ€ โ€œdriftโ€), while low-fluency users disproportionately fall into โ€œwalkawayโ€ and โ€œconfidence trapโ€ modesโ€”failures that are undetectable to the user and leave intentions unfulfilled without overt signals. Figure 8

Figure 8: Heatmap of failure archetype association across fluency levels, highlighting strong co-occurrence between high fluency and visible/partial recovery failures.

Figure 9

Figure 9: Dataset-wide distribution of archetype labels, illustrating the overall landscape of failure types.

Implications for AI Product Design and Theory

These results foreground the critical, underappreciated role of user behavior in AI-mediated task outcomes. The paradoxโ€”that greater expertise yields greater visible failure, yet also greater ultimate successโ€”demands a rethinking of both AI system evaluation and product/interface design. The data indicate that โ€œfrictionless,โ€ passive user experiences amplify undetected failure and undermine the practical utility of even the best LLMs. Effective AI usage is fundamentally augmentative: it requires users to treat the AI as an imperfect collaborator, not a perfect oracle.

Practically, this suggests several prescriptions:

  • User Training: Develop interventions, educational resources, and feedback systems that cultivate augmentative, iterative engagement with AI, particularly among novices.
  • Interface Design: Move beyond assumptions of user passivity; embed scaffolding that encourages (or even requires) critical engagement, iterative prompt refinement, and explicit goal tracking.
  • Evaluation: Consider failure visibility and user interaction mode as core metrics in system benchmarks, not just aggregate success rates.

Theoretically, these findings point to AI capability and user interface design as only partial determinants of success; user cognitive stance and interaction style are equally pivotal. This challenges modeling assumptions that treat the userโ€“AI interaction as a one-way function from prompt to output.

Prospective Directions

Given the persistence of passive engagement among the majority of users, future research directions include:

  • Design and randomized evaluation of educational or UX interventions to increase the prevalence of augmentative behaviors and monitor their impact on invisible failure rates.
  • Longitudinal tracking to assess whether iterative exposure or feedback improves fluency over time, and the sociotechnical factors that facilitate this development.
  • Systematic investigation of how model-side changes (e.g., more robust clarification seeking, proactive error signaling) interact with user fluency to shape outcomes.

Conclusion

This study demonstrates that the quality of outcomes in AI-driven conversation is determined not solely by system capabilities, but by user engagement, fluency, and interaction patterns. The paradoxical increase in overt failures among experts is a sign of effective, critical AI usage: experts expose and recover from failures more reliably and achieve higher success on challenging tasks, while novices risk undetected errors due to passive interaction. These insights have direct implications for the design and deployment of AI systems, suggesting a move away from passivity-oriented user experiences towards interfaces and educational strategies that actively promote iterative, collaborative engagement.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

There was an error generating the whiteboard.

Explain it Like I'm 14

A simple guide to โ€œA paradox of AI fluencyโ€

Overview: What is this paper about?

This paper asks a big question: Does being good at using AI change what you get out of it? By studying thousands of real chat conversations between people and an AI, the authors show that skilled users act like active teammates with the AI, while beginners are more passive. Surprisingly, skilled users see more failuresโ€”but those failures are easier to spot and fix, and the skilled users also succeed more often on harder tasks.

What questions did the researchers ask?

They focused on three easy-to-understand questions:

  • Do more skilled users try harder, more complex tasks?
  • Do skilled users โ€œwork withโ€ the AI differently than beginners do?
  • How do skill levels relate to failuresโ€”especially failures that are hard to noticeโ€”and to overall success?

How did they study it?

They looked at 27,000 anonymous ChatGPT conversations from a public dataset, covering 27 months (May 2023โ€“July 2025). Then they asked AI tools to label each conversation in three ways:

  • User fluency (how skilled the person seemed):
    • High, moderate, low, or minimal.
    • Interaction style: โ€œaugmentativeโ€ (treating AI like a thought partner) vs โ€œdelegativeโ€ (handing off the task and accepting whatever comes back).
    • Specific behaviors (like asking for clarification, iterating, or passively accepting answers).
  • Task complexity (how hard the task was):
    • A 1โ€“5 scale, plus notes about what made it hard (like needing domain knowledge or having an unclear goal).
  • Failure type:
    • Visible failures: the problem is obvious in the conversation (for example, the user says, โ€œThatโ€™s wrongโ€”try againโ€).
    • Invisible failures: the chat looks fine, but the result is off-target or wrong (for example, the AI sounds confident but is incorrect, and no one notices).
    • They also tracked patterns like โ€œThe Confidence Trapโ€ (AI is wrong but sounds sure) or โ€œThe Walkawayโ€ (the user quietly gives up).

To avoid skewed results, they mostly excluded two unusual groups of chats:

  • โ€œMidjourneyโ€ prompting scripts that went viral in 2023 (lots of similar prompts).
  • โ€œBlockman Goโ€ group chats that looked like an automated agent using the system.

Finally, they used a statistical model (think: a careful math check) to see whether user fluency still predicts success after accounting for other things like task difficulty, conversation length, and topic.

What did they find, and why does it matter?

Here are the main results, with simple examples and why theyโ€™re important:

  • Skilled users take on harder work.
    • On a 1โ€“5 difficulty scale, the highest-fluency users tackled tasks averaging about 3.1 vs about 1.5 for the lowest-fluency users (a 1.6-point gap).
    • Why it matters: Skilled users push AI toward more ambitious goals.
  • Skilled users collaborate; beginners delegate.
    • 93% of high-fluency chats were โ€œaugmentativeโ€ (the user iterates, asks follow-up questions, sets constraints, and critiques the AI).
    • Under 1% of minimal-fluency chats were augmentative; these users mostly โ€œdelegateโ€ and accept whatever the AI replies.
    • Why it matters: Treating AI like a teammate you guide leads to better outcomes.
  • The paradox: more skilled users see more failuresโ€”but thatโ€™s good.
    • High-fluency users: 64% of their chats showed at least one failure signal.
    • Minimal-fluency users: 24% showed a failure signal.
    • Why this isnโ€™t bad: For high-fluency users, 59% of failures were visible (they noticed issues and engaged to fix them). For minimal-fluency users, only 12% were visibleโ€”most failures were invisible (they โ€œlookedโ€ fine but missed the goal).
    • Translation: Beginners often walk away with results that seem OK but arenโ€™t. Experts catch and correct mistakes.
  • Skilled users are better at hard tasks and recovery.
    • They succeed more often on complex tasks.
    • They also show โ€œpartial recoveryโ€ patterns (they notice a problem, steer the AI, and get closer to the goal).
    • Why it matters: Real progress often comes from noticing errors early and iterating.
  • Fluency itself predicts outcomes, even after controlling for other stuff.
    • After accounting for task difficulty, conversation length, and topic, higher user fluency still:
    • Increases the chance of success overall.
    • Increases the chance that failures are visible (and therefore fixable).
  • Extra context: Whoโ€™s using AI, and how is it changing?
    • High-fluency users are rare and stayed rare throughout the time period.
    • The number of minimal-fluency users grew over time.
    • Why it matters: Most people need help learning how to use AI effectively.

What does this mean for the future?

Here are the practical takeaways:

  • For everyday users:
    • Be an active partner. Ask the AI to clarify, add constraints, compare options, show sources, and check its work.
    • Donโ€™t accept the first answer. Iterate. If something feels off, say so and steer the conversation.
    • Think of AI like a smart but sometimes overconfident teammate. Your job is to guide and verify.
  • For teachers and trainers:
    • Teach โ€œAI fluencyโ€ skills: how to refine prompts, question outputs, detect subtle misses, and recover from errors.
    • Emphasize that โ€œsounds confidentโ€ doesnโ€™t equal โ€œcorrect.โ€
  • For product designers and AI builders:
    • Design for engagement, not just convenience. Interfaces should invite users to check, compare, and iterate.
    • Build features that surface uncertainty, encourage critique, and make it easy to revise.
    • Short-term โ€œfrictionโ€ (more steps) can improve long-term outcomes.

In short: Success with AI isnโ€™t just about how powerful the model is. Itโ€™s also about how you use it. The people getting the most value donโ€™t passively accept answersโ€”they actively work with the AI, catch problems early, and improve the results step by step.

Knowledge Gaps

Knowledge gaps, limitations, and open questions

The paper leaves several important issues unresolved that future work could address to strengthen validity, causal claims, and generalizability.

  • External validity: Results are drawn from a free, English-only, deidentified ChatGPT dataset (WildChat-4.8M). It is unclear whether findings hold for paid/professional users, other platforms/models, or non-English interactions.
  • Model-version confounds: Analyses do not control for the specific LLM used (e.g., GPTโ€‘3.5 vs GPTโ€‘4 variants) or model updates over time, which could affect failure visibility and success rates independently of user fluency.
  • User identity and clustering: Lack of persistent user IDs prevents modeling user-level random effects or learning dynamics; per-user trajectories and within-user changes in fluency are unobservable.
  • Representativeness of the โ€œStandardโ€ subset: Excluding the large Midjourney and Blockman subsets improves internal focus but may reduce ecological validity; sensitivity analyses beyond fluency distributions (e.g., outcomes, failures) are not fully reported.
  • Non-toxic filter bias: Using only the non-toxic subset may remove important failure patterns (e.g., safety refusals, abuse), potentially underestimating or reshaping failure visibility and archetype distributions.
  • LLM-driven annotation reliability: Fluency, complexity, and failure labels are produced by LLMs; no human validation, inter-annotator agreement, or calibration is reported, leaving measurement error and bias unquantified.
  • Cross-construct dependence: The same or closely related LLM families annotate multiple constructs (fluency, complexity, failures), risking correlated errors or circularity in measuring predictors and outcomes.
  • Annotation pipeline consistency over time: Early months reuse failure labels from a prior study; later months are newly annotated. Cross-period consistency checks for annotation criteria and thresholds are not reported.
  • Construct validity of โ€œfluencyโ€: The 17 fluency and 7 anti-fluency behaviors are LLM-defined; evidence that these behaviors measure a coherent, stable construct across tasks, domains, and time is not provided.
  • Complexity metric ambiguity: The paper alternates between โ€œconfidence_scoreโ€ and โ€œcomplexity_scoreโ€ for a 1โ€“5 scale; the exact variable used, its semantic meaning, and its validation are unclear.
  • Success definition tied to failure labels: โ€œSuccessโ€ is defined as the absence of any failure indicators, which could penalize high-fluency users who make failures visible and may not reflect whether the userโ€™s goal was actually achieved.
  • Ground truth outcomes: There is no independent ground truth (e.g., task completion judged by humans, external result checks, or user-reported outcomes) to validate success/failure labels, especially for โ€œinvisibleโ€ failures.
  • Domain confounds: Task domain is only a random intercept; domain-by-fluency interactions and domain-specific effects (e.g., coding vs writing vs brainstorming) are not analyzed, limiting actionable guidance by task type.
  • Causal inference: Observational correlations between fluency and outcomes cannot establish that augmentative behaviors cause better outcomes; randomized or quasi-experimental interventions are needed.
  • UI/product implications untested: The recommendation to โ€œencourage frictionโ€ and deeper engagement is not evaluated via controlled UX experiments (A/B tests), leaving user satisfaction and long-term retention effects unknown.
  • Mediation mechanisms: Which specific fluency behaviors mediate improvements (e.g., iterative refinement vs verification steps) is not disentangled; mediation analyses are absent.
  • Ordinal treatment of fluency: Fluency levels are encoded as a linear scalar in GLMMs; ordered-logit models or non-linear effects (e.g., thresholds) are not explored.
  • Missing interactions: Potential interactions (e.g., fluency ร— complexity, fluency ร— number of turns) are not modeled, obscuring how fluency advantages vary with task difficulty or conversation length.
  • Statistical uncertainty: Figures and PPMI heatmaps lack confidence intervals or significance testing; small high-fluency sample sizes may inflate apparent associations.
  • Archetype stability: The prevalence and interpretation of invisible failure archetypes are not validated across annotator models or with human judgment; portability of archetype detection is uncertain.
  • Partial recovery vs success: โ€œPartial recoveryโ€ is treated as a failure archetype; alternative success taxonomies that recognize partial goal achievement are not tested, potentially undercounting meaningful success.
  • Time trends vs composition effects: Reported temporal trends in fluency may reflect changing user mix, task mix, or exogenous events rather than genuine population-level skill shifts; decomposition analyses are absent.
  • Sensitivity to conversation length: Length (n_turns) is a covariate but also an outcome of engagement; causal direction (does engagement cause length or vice versa?) remains ambiguous.
  • Domain expertise confound: The study does not measure usersโ€™ domain expertise; higher โ€œAI fluencyโ€ may proxy for general task competence, confounding interpretation of fluency effects.
  • Cross-language generalization: No analysis examines whether patterns hold in non-English conversations or multilingual users, limiting applicability to global user bases.
  • Per-turn dynamics: Interaction styles are assigned at the transcript level; within-conversation shifts (e.g., delegative โ†’ augmentative) and their impact on outcomes are not modeled.
  • Replicability across annotator models: Robustness of findings to different annotator LLMs or model settings (temperature, prompts) is not reported; version drift may affect reproducibility.
  • Exogenous-event analytics: While Midjourney and Blockman are identified as influential, the paper stops short of modeling how such events systematically reshape fluency, failure visibility, and outcomesโ€”useful for forecasting future shocks.
  • Downstream impact of invisible failures: The real-world consequences of โ€œinvisibleโ€ failures (e.g., errors propagated to decisions or artifacts) are not measured; risk assessment remains speculative.
  • Demographic and equity considerations: Without user demographics, it is unknown whether invisible failures disproportionately affect specific groups, or whether fluency interventions widen/narrow performance gaps.
  • Task-type heterogeneity: The study aggregates across task types; identifying which tasks benefit most from augmentative strategies (and which do not) remains an open question.
  • Data loss characterization: โ€œSmall random data lossโ€ during annotation is mentioned but not quantified or audited for non-random patterns that could bias results.
  • Guidance granularity: Recommendations (e.g., teach users to iterate) are high-level; concrete, testable pedagogical interventions and curricula for improving AI fluency are not specified.
  • Measurement invariance over time: No checks assess whether fluency or complexity scales remain consistent across months and model eras, risking temporal drift in labels.
  • Ethical implications of making failures visible: The trade-off between surfacing failures and user experience (frustration, trust erosion) is not empirically evaluated.

Practical Applications

Immediate Applications

Below are actionable uses you can deploy now, mapped to sectors where relevant and noting key dependencies/assumptions.

  • UX patterns that nudge โ€œaugmentativeโ€ use over โ€œdelegativeโ€ use (Software, Enterprise SaaS)
    • What: Add lightweight โ€œengagement scaffoldsโ€ in LLM apps: intent-recap prompts (โ€œHereโ€™s what I think you wantโ€”confirm or correctโ€), success-criteria checks, critique buttons (โ€œchallenge this,โ€ โ€œshow uncertainties,โ€ โ€œlist assumptionsโ€), planโ€“reviewโ€“execute toggles, and end-of-session alignment checks (โ€œDid we meet your goal?โ€).
    • Why (from paper): High-fluency users iterate, question, and steer; novices passively accept. Making failure visible reduces โ€œinvisible failures.โ€
    • Dependencies/assumptions: Modest UI/UX changes; minor latency budget for extra prompts; willingness to trade some short-term satisfaction for higher long-term success.
  • โ€œInvisible failureโ€ watchdogs embedded in chat workflows (Software, Customer Support, Education)
    • What: Heuristics or LLM-based detectors that flag likely archetypes in-session: silent mismatch (intent recap mismatch), drift (topic deviation), contradiction (response conflicts with prior turns), death spiral (repetition loops), walkaway risk (no closure). Trigger recovery macros (โ€œask clarifying Qs,โ€ โ€œsummarize agreed goal,โ€ โ€œpropose next stepโ€).
    • Why: Novices disproportionately suffer silent/invisible failures; experts recover more often.
    • Dependencies/assumptions: Use zero-/few-shot detectors today; instrument for precision/recall and human review; privacy-safe logging.
  • Fluency-aware โ€œcoach modeโ€ for novices (Software, EdTech, Enterprise enablement)
    • What: Real-time guidance that spots anti-fluency behaviors (passive acceptance, one-shot asks) and suggests augmentative moves (iterative refinement, critical review). Provide micro-tips, e.g., โ€œAsk me to restate your constraints,โ€ โ€œRequest alternatives and trade-offs.โ€
    • Why: Paperโ€™s 17 fluency behaviors and 7 anti-fluency behaviors give a ready-made rubric for coaching.
    • Dependencies/assumptions: On-device or server-side classifiers approximating the paperโ€™s tagging (tag_user_fluency.py); careful UX to avoid nagging.
  • Analytics that track โ€œfailure visibility ratio,โ€ fluency behaviors, and task complexity (Software, Product Ops, Policy evaluation)
    • What: Add product telemetry to measure visible vs. invisible failure proxies, counts of augmentative behaviors, conversation length, and complexity attempted. Build dashboards to A/B test engagement scaffolds.
    • Why: The paperโ€™s regressions show fluency predicts success and visibility; complexity depresses successโ€”teams need to see these trade-offs.
    • Dependencies/assumptions: Consent, anonymization, and governance for transcript analytics; adopt the paperโ€™s tagging scripts as baselines.
  • Rapid upskilling programs on AI fluency (Academia, Workforce Development, Corporate L&D)
    • What: Short courses and micro-credentials that train augmentative habits: iterative refinement, goal negotiation, verification, contradiction checks, uncertainty awareness, and recovery techniques tied to the archetypes.
    • Why: Success depends on user behavior as much as model capability; experts show higher success on complex tasks.
    • Dependencies/assumptions: Curriculum built from the paperโ€™s behavior taxonomy; pre/post assessments using classifier-based rubrics.
  • Domain-specific โ€œalignment checksโ€ embedded in existing workflows
    • Healthcare: In clinical documentation assistants, require: patient-specific constraints echo, uncertainty lists, source links, and clinician confirmation before saving. Dependencies: Clinical validation and privacy compliance.
    • Finance/Legal: For memos/reports, require: constraints checklist, regulatory references, contradiction scan across sections, and sign-off prompts. Dependencies: Compliance frameworks; auditable logs.
    • Software engineering: In code-gen flows, require: โ€œspec echoโ€ (restate requirements), test-plan proposal, and contradiction checks vs. prior requirements. Dependencies: CI integration; unit-test harnesses.
    • Customer support: Ticket assistants perform: intent recap with the customer, drift alerts, and โ€œresolution confirmationโ€ before closure. Dependencies: CRM integration.
  • Retrospective audits of internal chat logs using the paperโ€™s methods (Industry, Academia)
    • What: Apply tag_task_complexity.py and failure-archetype annotators to historical interactions to quantify invisible failures, identify high-impact archetypes, and prioritize product fixes and training.
    • Dependencies/assumptions: Secure data access; acceptable LLM annotator costs; sampling for validation given LLM-based annotation noise.
  • Procurement and IT governance guidance emphasizing engagement (Policy, Public Sector)
    • What: Update evaluation checklists to require augmentative affordances (clarification prompts, goal restatements, recovery pathways) and training plans for end users.
    • Dependencies/assumptions: Cross-agency consensus; alignment with accessibility and language access policies.
  • Everyday personal assistant features that reduce invisible failures (Daily Life)
    • What: Travel/booking/task assistants that auto-verify hours/prices, prompt for constraints and preferences, and propose alternatives with rationale.
    • Dependencies/assumptions: API access to verification sources; users accept one or two added questions for higher accuracy.
  • Product KPIs that reward โ€œengagement quality,โ€ not just CSAT (Software, Management)
    • What: Add metrics like โ€œvisible failure ratio,โ€ โ€œrecovery rate after drift,โ€ and โ€œgoal confirmation rateโ€ to performance dashboards.
    • Dependencies/assumptions: Leadership buy-in to re-balance incentives away from frictionless experiences.

Long-Term Applications

These require additional research, scaling, validation, or regulatory work before broad deployment.

  • Adaptive UIs that personalize scaffolding to detected fluency level (Software, EdTech, Enterprise)
    • What: Real-time user modeling that tunes the degree of guidanceโ€”novices get structured plans and mandatory alignment checks; experts get advanced controls and optional scaffolds.
    • Dependencies/assumptions: Reliable fluency classifiers across domains/languages; guardrails to avoid misclassification harms.
  • LLMs trained to be proactive collaborators (Model training, RLHF/RLAIF)
    • What: Fine-tune/reward models to initiate intent recap, ask clarifying questions, surface uncertainties, detect contradictions, and propose partial recoveriesโ€”explicitly optimizing to reduce invisible failures.
    • Dependencies/assumptions: High-quality supervised signals for archetype detection; evaluation benchmarks for invisible failure reduction.
  • Standardized โ€œInvisible Failure Preventionโ€ SDKs and telemetry schemas (Software ecosystem)
    • What: Open libraries that provide conversation-state tracking, archetype detectors, contradiction/drift checkers, and plug-in recovery macros; common telemetry fields for cross-app comparisons.
    • Dependencies/assumptions: Community consensus on archetype taxonomy; performance guarantees; privacy-first designs.
  • Sector-specific co-pilots with safety-grade visibility guarantees
    • Healthcare: Co-pilots that meet evidence-tracking and uncertainty-exposure standards, with measurable reductions in silent mismatches; EHR-integrated recovery workflows.
    • Finance: Advisors with audit trails linking claims to sources and policy constraints; automated contradiction scans across reports.
    • Dependencies/assumptions: Regulatory approval; third-party audits; liability frameworks.
  • Regulatory standards and audits for humanโ€“AI interaction quality (Policy, Standards bodies)
    • What: ISO-/NIST-like guidance that requires augmentative affordances, measures of failure visibility, and user training provisions for high-stakes deployments.
    • Dependencies/assumptions: Multi-stakeholder processes; sector-specific tailoring; enforcement mechanisms.
  • Longitudinal causal studies on training impacts (Academia, Policy evaluation)
    • What: Randomized field experiments that quantify how AI fluency courses change failure visibility, recovery rates, task complexity attempted, and real productivity outcomes.
    • Dependencies/assumptions: Access to telemetry; ethical approvals; representative samples beyond public chat datasets.
  • โ€œEngagement OSโ€ for enterprises (Software, IT)
    • What: A cross-app layer that standardizes intent recap, alignment checks, and recovery prompts across all in-house AI tools; central dashboards for invisible failure risk.
    • Dependencies/assumptions: Integration with heterogeneous systems; change management across teams.
  • Evaluation suites and leaderboards for invisible failure reduction (Research, Model evaluation)
    • What: Benchmarks that stress-test models for drift, contradiction, and silent mismatch across domains; report โ€œInvisible Failure Scoreโ€ alongside accuracy.
    • Dependencies/assumptions: Public datasets with gold labels; consensus metrics; community adoption.
  • Agentic systems with internal โ€œself-critique loopsโ€ (Robotics, Autonomy, Software agents)
    • What: Agents that monitor for drift and contradiction in their own plans, initiate clarifying queries to humans, and trigger safe fallbacks when recovery stalls.
    • Dependencies/assumptions: Reliable online detectors; human-in-the-loop protocols; real-time constraints.
  • Credentialing for AI fluency (Education, HR)
    • What: Recognized certifications tied to demonstrable augmentative behaviors and recovery skills; used in hiring and promotion for AI-augmented roles.
    • Dependencies/assumptions: Validated assessments; industry recognition; avoidance of credential inflation.
  • Incentive redesign in organizations (Management, Operations)
    • What: Shift success metrics from speed-only to โ€œsuccessful completion without invisible failures,โ€ rewarding teams for visible failure surfacing and recovery.
    • Dependencies/assumptions: Cultural buy-in; alignment with customer satisfaction and compliance goals.

Cross-cutting assumptions and dependencies

  • Generalizability: Findings are derived from WildChat-4.8M (public ChatGPT usage, English, 2023โ€“2025) and LLM-based annotations; domain, language, and model differences may affect transfer.
  • Annotation reliability: Automated fluency/archetype tagging should be validated and periodically recalibrated as models evolve.
  • Privacy and governance: Any telemetry or transcript analysis requires strong anonymization, consent, and data minimization.
  • User experience trade-offs: Engagement scaffolds add โ€œconstructive frictionโ€; measure and balance against user satisfaction and abandonment.
  • Model drift and upgrades: As LLMs improve, archetype prevalence may change; tooling and training must evolve accordingly.

Glossary

  • Agency: The capacity of a user to actively shape and direct AI interactions and outcomes. Example: "Agency is a prominent theme of recent work on how AI expertise relates to outcomes"
  • AI fluency: Expertise in effectively using AI systems, including prompting, iteration, and critical evaluation of outputs. Example: "Our own approach to AI fluency is inspired by \citet{anthropic2026fluency}"
  • Anti-Fluency Behaviors: Behaviors that undermine effective AI use, such as passivity or uncritical acceptance. Example: "Anti-Fluency Behaviors: A list of zero or more behavior_name annotations with metadata:"
  • Augmentative: An interaction style that treats the AI as a collaborator, iterating and refining toward the goal. Example: "Fluent users adopt an augmentative stance: they iterate collaboratively with the AI, refining their goals and critically assessing outputs as they go"
  • Augmentative mode: A collaborative mode of working with the AI that surfaces and addresses failures during the interaction. Example: "these failures stem from experts' augmentative mode, which makes failure more visible and leads to a higher likelihood of recovery"
  • Augmentative stance: A user's posture of actively collaborating with the AI rather than delegating tasks entirely. Example: "Fluent users adopt an augmentative stance: they iterate collaboratively with the AI, refining their goals and critically assessing outputs as they go"
  • Binomial: A statistical family used in generalized linear models for binary outcomes. Example: "family=binomial"
  • BOBYQA optimizer: A derivative-free optimization algorithm (Bound Optimization BY Quadratic Approximation) used to fit complex models. Example: "with the BOBYQA optimizer."
  • Co-occurrence matrix: A matrix counting how often pairs of categories (e.g., archetypes and fluency levels) appear together. Example: "we first compiled a co-occurrence matrix between the archetypes and the fluency levels"
  • Common ground: Shared knowledge and assumptions that conversational partners rely on to coordinate meaning. Example: "with a familiar sense of how common ground is established and negotiated"
  • Conceptual replication: A study that tests the same underlying claim as prior work using different data or methods. Example: "We provide what is in effect a conceptual replication of this result using \mbox{WildChat-4.8M}"
  • Delegative: An interaction style where users hand off tasks to the AI and passively accept outputs. Example: "By contrast, novice users are delegative: they tend to passively accept the AI's plans and responses, and often fail to get the information they need as a result."
  • Domain expertise: Knowledge and skill in the taskโ€™s subject area (distinct from AI expertise). Example: "identify connections between domain expertise and productivity."
  • Exogenous events: External, atypical factors that affect a dataset but are not inherent to the underlying process. Example: "they are exogenous events, and future events might have a different character."
  • Fixed-effects predictors: Predictors in a statistical model whose effects are assumed to be constant across groups or clusters. Example: "The other fixed-effects predictors are {n_turns}, {complexity_score}, and {fluency_behavior_count}."
  • Fluency behaviors: Observable actions indicating effective AI use (e.g., iteration, critical assessment). Example: "high-fluency users show many high-fluency behaviors, and low-fluency users show many anti-fluency behaviors."
  • Generalized linear mixed-effects models: Statistical models that handle non-normal outcomes and include both fixed and random effects. Example: "we developed two generalized linear mixed-effects models"
  • Invisible failure: A failure where the interaction appears successful but the userโ€™s goal is not actually met. Example: "Failure rates rise consistently by fluency level, but invisible failure rates fall by fluency level."
  • Invisible failure archetypes: Named patterns that characterize how subtle, unnoticed failures occur in AI interactions. Example: "the invisible failure archetypes of \citealt{potts2026invisible}"
  • Iterative_refinement: A fluency behavior involving repeated, focused improvement of the AIโ€™s outputs. Example: "One pattern that stands out is that of {iterative_refinement}: it is highly characteristic of high-fluency users and essentially absent from low-fluency transcripts."
  • Mixed (failure mode): A classification indicating that both visible and invisible failure signals are present. Example: "Basic failure mode classification: {visible}, {invisible}, or {mixed}"
  • Passive_acceptance: An anti-fluency behavior where users accept AI outputs without sufficient scrutiny or iteration. Example: "A counterpart of this is {passive_acceptance}, which is the strongest single indicator of minimal- and low-fluency."
  • Positive pointwise information (PPMI): A nonnegative association measure highlighting unexpectedly frequent co-occurrences. Example: "positive pointwise information (PPMI; \citealt{church-hanks-1990-word,Bullinaria2007})"
  • Pragmatic: Relating to how meaning is inferred from context and conversational norms beyond literal content. Example: "they will be pragmatic, human-like conversationalists"
  • PPMI matrix: A co-occurrence matrix reweighted by PPMI to emphasize informative associations. Example: "The heatmap is a PPMI matrix."
  • Random intercept: A model component allowing each group (e.g., domain) its own baseline level. Example: "we included a random intercept for each domain."
  • R2 (marginal/conditional): Goodness-of-fit measures for mixed models; marginal R2 for fixed effects only, conditional R2 including random effects. Example: "R2 Marg."
  • RMSE: Root Mean Squared Error, a measure of the typical prediction error magnitude. Example: "RMSE"
  • Scope ambiguity: Uncertainty about the boundaries or interpretation of the userโ€™s task or request. Example: "{scope_ambiguity}"
  • The Confidence Trap: An archetype where the AIโ€™s unwarranted certainty leads the user to accept incorrect information. Example: "The Confidence Trap"
  • The Contradiction Unravel: An archetype where the AI contradicts its prior statements without acknowledgment. Example: "The Contradiction Unravel"
  • The Death Spiral: An archetype characterized by repetitive, unproductive loops that fail to incorporate feedback. Example: "The Death Spiral"
  • The Drift: An archetype where the conversation veers off-topic from the userโ€™s original goal. Example: "The Drift"
  • The Mystery Failure: An archetype where the interaction fails without a clear, identifiable pattern. Example: "The Mystery Failure"
  • The Partial Recovery: An archetype where some progress is recovered after a failure, but the goal is not fully achieved. Example: "The Partial Recovery"
  • The Silent Mismatch: An archetype where the AI competently answers the wrong question without either party noticing. Example: "The Silent Mismatch"
  • The Walkaway: An archetype where the user silently stops the conversation without resolution. Example: "The Walkaway"
  • Visible failure: A failure that is overtly signaled in the interaction (e.g., user correction, explicit recognition). Example: "If the failure is {visible}, then only {Visible failure} is assigned."

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 9 tweets with 5595 likes about this paper.