Papers
Topics
Authors
Recent
Search
2000 character limit reached

Humanlike AI Design Increases Anthropomorphism but Yields Divergent Outcomes on Engagement and Trust Globally

Published 19 Dec 2025 in cs.AI | (2512.17898v1)

Abstract: Over a billion users across the globe interact with AI systems engineered with increasing sophistication to mimic human traits. This shift has triggered urgent debate regarding Anthropomorphism, the attribution of human characteristics to synthetic agents, and its potential to induce misplaced trust or emotional dependency. However, the causal link between more humanlike AI design and subsequent effects on engagement and trust has not been tested in realistic human-AI interactions with a global user pool. Prevailing safety frameworks continue to rely on theoretical assumptions derived from Western populations, overlooking the global diversity of AI users. Here, we address these gaps through two large-scale cross-national experiments (N=3,500) across 10 diverse nations, involving real-time and open-ended interactions with an AI system. We find that when evaluating an AI's human-likeness, users focus less on the kind of theoretical aspects often cited in policy (e.g., sentience or consciousness), but rather applied, interactional cues like conversation flow or understanding the user's perspective. We also experimentally demonstrate that humanlike design levers can causally increase anthropomorphism among users; however, we do not find that humanlike design universally increases behavioral measures for user engagement and trust, as previous theoretical work suggests. Instead, part of the connection between human-likeness and behavioral outcomes is fractured by culture: specific design choices that foster self-reported trust in AI-systems in some populations (e.g., Brazil) may trigger the opposite result in others (e.g., Japan). Our findings challenge prevailing narratives of inherent risk in humanlike AI design. Instead, we identify a nuanced, culturally mediated landscape of human-AI interaction, which demands that we move beyond a one-size-fits-all approach in AI governance.

Summary

  • The paper demonstrates that increasing humanlike design enhances anthropomorphic perceptions via both surface cues and conversational sociability.
  • Using cross-national studies and a factorial experiment, the research shows that heightened anthropomorphism does not uniformly boost user engagement.
  • The behavioral trust game reveals that in some cultures, machine-like cues yield higher trust than overtly humanlike features.

Humanlike AI Design and Its Global Impact on Anthropomorphism, Engagement, and Trust

Introduction

This paper investigates how explicit design choices that render AI systems more humanlike affect users' perceptions, engagement, and trust across different cultural contexts. By manipulating two orthogonal axes—Design Characteristics (DC: surface-level humanlike cues and behavioral idiosyncrasies) and Conversational Sociability (CS: the system's propensity for empathetic, emotionally warm, or "human" interaction)—the authors systematically quantify the multi-faceted psychological and behavioral impact of anthropomorphic AI features.

Methodology

Two large-scale cross-national studies serve as the empirical foundation. In Study 1, qualitative coding and taxonomy-based analysis are applied to free text feedback after user-AI interactions involving food conversations, distributed across ten countries and multiple languages. Study 2 introduces a four-cell factorial experimental framework manipulating DC and CS, and evaluates effects via direct user ratings and a behavioral trust game. Participants represent diverse nations with varying cultural, religious, and linguistic backgrounds, and analyses account for demographic covariates and prior AI experience.

The robust construction of treatment prompts in Study 2 isolates DC and CS effects effectively: DC manipulation involves name usage, colloquial language, and conversational variability; CS manipulation modulates emotional warmth, empathy, and self-reflection. The trust game is designed with explicit monetary incentives, measuring actual behavioral trust rather than self-reported attitudes.

Key Findings and Numerical Results

Anthropomorphism

Firm evidence demonstrates that increased humanlike DC—distinctive names, pronouns, colloquial turns, emotive language—increases explicit user attribution of anthropomorphic qualities. CS further amplifies perceived social-emotional resonance. However, this anthropomorphic perception is not universal: substantial cross-country and within-country heterogeneity emerges, tightly correlated with cultural FstF_{st} distance from the USA and other demographic factors.

Engagement and Trust

Contrary to intuitive design theory, maximizing anthropomorphism via DC and CS does not uniformly increase user engagement metrics or trust scores. Engagement, operationalized as conversational length, depth, and rating scales, varies idiosyncratically by treatment and cultural group. Higher anthropomorphism sometimes reduces engagement, especially in cultures distant from prototypical Western norms.

The incentivized trust game yields salient, contradictory findings: treatments producing the highest explicit anthropomorphism do not consistently maximize trust-based monetary transfers, and in some groups, emotionally neutral or "machine-like" designs lead to greater behavioral trust. This challenges dominant hypotheses in the HCI and social robotics domains linking anthropomorphism directly to trust.

Inter-rater Reliability and Automated Labeling

Human agreement (Cohen's Kappa) on anthropomorphic theme coding is highest on easily recognizable traits (friendliness, warmth, response speed), but drops sharply on subtler or culturally modulated traits such as intelligence and agency. The Gemini LLM used for automated labeling achieves a high F1F_1 score on major labels but struggles with nuanced categories, suggesting boundary cases remain problematic for ML-based sociocultural evaluation at scale.

Theoretical and Practical Implications

The results foreground that anthropomorphism is context-sensitive, not a monolithic psychological phenomenon. AI system designers should consider that cues designed to support anthropomorphic attributions may not reliably produce desired outcomes in engagement or trust and may have negative or paradoxical effects across cultures or user groups. The data also question the dominant Western-centric design logic in conversational AI and highlight the need for localized or personalized approaches.

For practitioners, the intricate mapping between DC, CS, and target behavioral metrics (engagement, trust, compliance) suggests that optimizing for one may not produce uniform improvements in others and may yield unintended adverse effects. The trust game results underscore the need to empirically test behavioral correlates of trust rather than rely on self-report or superficial engagement metrics.

Future Directions

The global divergence in outcomes calls for further investigation into the mechanisms by which anthropomorphic cues interact with user values, prior expectations, and local social norms. Expanding the taxonomy beyond predefined labels, improving ML models for culturally adaptive feedback analysis, and developing flexible, user-aware prompt engineering strategies will be critical. Longitudinal studies examining sustained engagement and trust evolution after repeated human-AI contact are warranted. There are also implications for regulatory policy regarding transparency and disclosure of AI "personality" traits in critical applications such as finance, healthcare, or governmental services.

Conclusion

This paper establishes that humanlike AI design robustly increases anthropomorphic perception but does not yield universally positive or consistent effects on engagement or trust, with substantial cross-cultural divergence. Theoretical models linking anthropomorphism and trust must be re-examined in light of these results, and practical AI design should favor empirically grounded, context-specific strategies rather than monolithic anthropomorphic enhancement (2512.17898).

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Explain it Like I'm 14

Overview

This paper studies how making an AI “feel” more human affects the way people see and interact with it. The authors look at whether humanlike AI changes:

  • Anthropomorphism (how much people think the AI is like a human),
  • Trust (how willing people are to rely on it), and
  • Engagement (how much people enjoy and continue the conversation).

The key idea: adding humanlike traits to an AI (like using a name, showing empathy, and writing in a casual style) helps some people trust and engage more—but not everyone reacts the same way.

Objectives

The paper asks simple, practical questions:

  • If an AI talks more like a person, do users see it as more human?
  • Does that humanlike style make people trust the AI more?
  • Do people enjoy and engage more with the AI when it feels warm and personal?
  • Do these effects differ across countries and user groups?

How the researchers did it

Two studies with chat-based AI

  • Study 1: People from 10 countries had a casual chat with an AI (based on GPT-4o) about everyday topics like food, movies, or music. The conversation was meant to feel normal and friendly.
  • Study 2: People chatted with an AI designed in four different styles. These styles combined:
    • Design Characteristics (how the AI “looks and sounds” in text): Does it use a name? Vary message length? Use slang, smileys, or occasional typos?
    • Conversational Sociability (how the AI “behaves socially”): Does it show empathy, express opinions, sound warm, and feel like it has a personality?

Together, these created four versions:

  1. Low design + Low sociability (machine-like and emotionally cold),
  2. High design + Low sociability (human-style text but emotionally cold),
  3. Low design + High sociability (machine-like text but emotionally warm),
  4. High design + High sociability (human-style text and emotionally warm).

Measuring trust with a simple game

Participants played a “trust game” with the AI:

  • You start with 100 points.
  • You can send some points to the AI (0–100). The computer triples what you send.
  • The AI then decides how many points to return to you.
  • Sending more points shows higher trust (you risk more, hoping the AI returns a fair amount).

Think of it like lending money to someone: you give them some, it gets boosted, and you hope they give you a good portion back.

Understanding “humanlikeness” in conversations

The team analyzed what users said about the AI using a set of labels (a “taxonomy”) such as:

  • Friendliness/Warmth,
  • Understanding the user,
  • Conversation flow,
  • Helpfulness,
  • Personality,
  • Empathy,
  • Accuracy, speed, and more.

Human experts labeled user comments, and the researchers also tested an automated labeler (Gemini) to see how well it matched the humans. They checked agreement between human raters (with a score called Cohen’s kappa) and the model’s accuracy (with an F1 score). In everyday terms:

  • Cohen’s kappa: How much two judges agree beyond luck.
  • F1 score: A single grade that balances “getting things right” and “not missing things.”

Considering differences across countries

Participants came from 10 countries (like the USA, Brazil, Germany, Japan, India, Egypt). The paper includes a measure of “cultural distance” from the USA to help explore whether culture affects how people respond to humanlike AI.

Main findings and why they matter

  • Making AI more humanlike often increases anthropomorphism: many users start to feel the AI understands them, has a personality, and is warm.
  • Humanlike traits can boost trust and engagement: in the trust game, some users were willing to send more points; in chats, some stayed longer and felt more connected.
  • But effects vary by user group: not everyone likes a very humanlike AI. Some users prefer a more neutral, machine-like style and may distrust or dislike overly friendly or emotional responses.
  • Style matters in specific ways:
    • Design features (name, varied message length, casual language, smileys) make the AI feel less robotic.
    • Social behavior (empathy, opinions, warmth) makes the AI feel more like a real conversational partner.
  • Measuring “humanlikeness” is possible and partly automatable: human raters agreed on many labels, and the automated tool did reasonably well, though not perfectly.

These results show that one-size-fits-all AI design can backfire. The best style depends on the audience.

Implications and impact

  • Build adaptive AI: Systems should be able to switch styles—more formal or more warm—based on who’s using them and what they prefer.
  • Don’t assume “more humanlike” is always better: It helps many people, but it can feel fake or uncomfortable to others.
  • Better trust measures: Simple games and clear labels help designers understand how changes in AI style affect user trust.
  • Global design matters: Cultural differences influence how people react to humanlike AI. Designing for diverse users requires testing across countries and groups.

In short, the paper suggests that thoughtful, flexible design—rather than maximum “humanness” all the time—can create AI that people trust and enjoy, without alienating those who prefer a more straightforward, machine-like assistant.

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a concise list of unresolved issues that future research could address.

  • Missing participant details and sampling transparency: No sample sizes, recruitment procedures, quotas/strata, response rates, or demographic breakdowns (age, gender, education, AI familiarity, religiosity) are reported; power analyses and attrition are unspecified.
  • Unreported group-level analyses beyond country: The SI includes placeholders for gender, age, prior AI experience, and religion effects but no results; which “user groups” show effects remains unclear and unreplicable.
  • Cross-cultural and language handling unclear: It is not specified how language was assigned or enforced per country, whether conversations occurred in participants’ native languages, how multilingual data were normalized, and how language differences were controlled in analyses.
  • Cultural Fst metric use unexplained: How CFst was computed, validated, and modeled (e.g., continuous vs thresholds, controls for GDP/education/internet usage) is not described; its relation to outcomes is not reported.
  • Incomplete reporting of outcome measures: Operational definitions and instruments for anthropomorphism, trust, and engagement are not provided (e.g., validated scales vs open-ended coding; behavioral metrics like conversation length/turns/time-on-task).
  • Frequency results absent: The “Frequency Analysis of anthropomorphic aspects” is referenced but not shown; underlying counts, confidence intervals, and cross-group comparisons are missing.
  • Trust game methodological inconsistency: Participant instructions say the AI can return up to all received (tripled) points, while AI instructions cap returns at the original amount sent (not tripled). This discrepancy threatens validity and interpretability of trust measures.
  • Trust game AI policy unspecified: No description of the AI’s return strategy (fixed rule vs stochastic vs learned) or parameterization (e.g., temperature) is provided; between-participant variability and fairness cannot be assessed.
  • Potential confound in trust game flow: The AI is instructed not to answer user questions during the game, likely disrupting the prior conversational rapport; the impact of this abrupt interaction style change on trust is not measured.
  • Stakes and cross-country comparability: Monetary incentives are small and not adjusted for local purchasing power; external validity across countries with different economic contexts is uncertain.
  • Model heterogeneity and reproducibility risk: Conversations in Study 1 used GPT-4o; autorater evaluation used Gemini; the conversational model used in Study 2 is not stated. Cross-model differences may confound effects; API versions, model dates, and any updates are not logged.
  • Missing LLM parameters and infrastructure details: Sampling settings (temperature, top_p, max tokens, stop sequences), context window limits, system vs user roles, and message truncation policies are not provided; these can materially affect style and outcomes.
  • Manipulation validity (DC vs CS) unverified: No manipulation checks confirm that Design Characteristics (DC) and Conversational Sociability (CS) were realized as intended; adherence rates and effect sizes on proximal markers (e.g., variation in length, colloquialisms, empathy expressions) are unreported.
  • Treatment instruction contradictions: In Treatment 2, DC requests “builds on what the user has said” while CS specifies minimal engagement and repetitive topic holding; such conflicts may yield inconsistent behavior. Similar overlaps exist between DC (topic shifting, active listening) and CS constructs.
  • Language and encoding fragility: Prompts show encoding artifacts (e.g., ’), which may alter model behavior across languages; robustness checks and corrections are not reported.
  • Disclosure and deception boundaries: Some treatments aim to “appear human-like” and reference “soul,” yet instructions also say not to lie about being human; how disclosure was implemented and perceived, and any debriefing or ethical safeguards, are not detailed.
  • Autorater taxonomy and construct validity: Overlapping constructs (e.g., Friendliness/Warmth vs Warmth) and broad labels (Quality, Intelligence, Competence) risk ambiguity; construct validity and discriminant validity are not demonstrated.
  • Open-world labeling in autorater: The instruction to add new labels beyond the taxonomy compromises comparability; how such labels were handled in analyses is unspecified.
  • Cross-lingual annotation reliability: While the autorater is told to understand multiple languages, there is no reporting of performance or human–machine agreement by language; translation pipelines (if any) for human raters are not described.
  • Inter-rater agreement and “golden standard” choice: Kappa values are shown without numeric tables, CIs, or per-label support. The “golden standard” is the union of two human raters, which can inflate recall; the rationale versus intersection/consensus or adjudication is not provided.
  • Automated labeling evaluation scope: Gemini’s F1 is reported but without dataset size, label prevalence, per-language performance, or error analyses (false positives/negatives), limiting interpretability.
  • Memory and context continuity: Instructions note the AI and user “already talked before,” but it is unclear whether prior conversation history was actually provided to the model during the trust game; the effect of context carryover is unknown.
  • Topic generalizability: Conversations were limited to food, hobbies, and friends; it is unclear whether findings generalize to task-oriented, high-stakes, or domain-specific interactions.
  • Engagement measurement and dynamics: No specification of objective engagement metrics (e.g., dwell time, turn count, response latency) or how manipulations affect sustained engagement over time (longitudinal effects) is provided.
  • Country-language alignment and name/pronoun use: How name usage, pronoun conventions, smileys/typos, and colloquialisms were localized culturally and linguistically—and their differential effects across cultures—are not examined.
  • Ethical review and user consent details: IRB/ethics approval, consent disclosures (e.g., that the partner is an AI), risks, and debriefing procedures are not reported.
  • Code/data availability: Data, prompts (with fixed encoding), and analysis code (including autorater pipeline) are not declared as available; reproducibility is limited.
  • Power, uncertainty, and robustness: No power analyses, uncertainty quantification (e.g., CIs, Bayesian intervals), sensitivity analyses (e.g., to model parameters, coder decisions, or label thresholds), or robustness checks (e.g., alternate codings, manipulations) are presented.

Practical Applications

Immediate Applications

Below are concrete ways the paper’s findings and methods can be deployed now. Each item includes sectors, likely tools/workflows, and feasibility notes.

  • Calibrated conversational style presets in AI products
    • Sectors: Software, customer support, education
    • What: Ship selectable presets keyed to the two dimensions tested (Design Characteristics [DC] and Conversational Sociability [CS])—e.g., “Formal-neutral” (low DC, low CS), “Natural but neutral” (high DC, low CS), “Warm-expressive” (low DC, high CS), “Humanlike+Warm” (high DC, high CS).
    • Tools/workflows: Use the paper’s treatment prompts as templates; implement per-feature toggles in assistant settings.
    • Assumptions/dependencies: Effects measured with GPT-4–class models and short chats generalize; teams can reliably enforce style constraints via prompting or system policies.
  • Context-aware trust calibration in high-stakes domains
    • Sectors: Healthcare triage, finance, legal, safety-critical operations
    • What: Default to lower CS (less emotionality/opinion claims) and lower DC (no name, fewer humanlike cues) in high-stakes flows to reduce over-trust and anthropomorphism; increase explicit uncertainty language.
    • Tools/workflows: A “safety style guardrail” that enforces low-DC/low-CS prompts in specified workflows.
    • Assumptions/dependencies: Lower anthropomorphic cues will reduce undue reliance without harming task performance; clinical/legal compliance policies support reduced small talk.
  • Engagement-optimized tutoring and coaching modes
    • Sectors: Education, upskilling, wellness coaching (non-clinical)
    • What: In non-high-stakes contexts, use higher CS (empathy, personality, figurative language) to increase engagement and persistence.
    • Tools/workflows: “Warm tutor” mode using Treatment 3/4 style features; A/B testing on session length and completion rates.
    • Assumptions/dependencies: Engagement gains do not induce miscalibrated trust in correctness; clear correctness checks are available.
  • Cross-cultural style localization
    • Sectors: Global product teams, localization
    • What: Tailor DC/CS settings by country/language segments given heterogeneous effects on anthropomorphism and trust; avoid one-size-fits-all “humanlike” defaults.
    • Tools/workflows: Locale-specific prompt libraries; lightweight guardrails for idioms/colloquialisms vs neutrality; local usability testing.
    • Assumptions/dependencies: Country-level heterogeneity in the paper generalizes; localization teams can audit cultural fit beyond language.
  • “Anthropomorphism toggle” for end users
    • Sectors: Consumer assistants, enterprise productivity tools
    • What: Provide a user-facing control (e.g., “Humanlike vs Neutral”) to respect user preferences and neurodiversity.
    • Tools/workflows: Map toggle positions to DC/CS prompt variants; persist per-user preference.
    • Assumptions/dependencies: Style adherence remains stable across tasks; no regressions in answer quality.
  • Automatic monitoring of humanlike cues in production conversations
    • Sectors: Software, safety/compliance, trust & safety
    • What: Deploy the paper’s taxonomy plus the “autorater” workflow to label live/chat transcripts for Emotionality, Sycophancy, Agency claims, etc., across multiple languages.
    • Tools/workflows: Batch or streaming labeler; dashboards with trend alerts; sampling for human QA.
    • Assumptions/dependencies: Model-based labeling (Gemini-style) has acceptable F1 for target labels; human review covers edge cases.
  • Sycophancy and over-positivity detection
    • Sectors: Safety, quality assurance
    • What: Use taxonomy labels (Positive tone/Sycophancy, Compliance with user requests) as heuristics to flag excessive agreement or flattery.
    • Tools/workflows: Policy checks in CI for prompts; runtime filters that down-regulate sycophantic phrasing.
    • Assumptions/dependencies: Labeler precision is sufficient for automated gating; mitigations don’t harm user satisfaction.
  • Trust measurement via embedded “micro trust games”
    • Sectors: UX research, product analytics
    • What: Adapt the incentivized trust game as a quick A/B metric for trust calibration across styles and locales (e.g., periodic small incentives in beta programs).
    • Tools/workflows: In-product experimental module, consented and transparent; compute “amount sent” as a trust proxy.
    • Assumptions/dependencies: Behavioral proxy correlates with real-world reliance; incentives are ethical and IRB/ethics-approved where needed.
  • Design linting for policy compliance
    • Sectors: Policy, governance, enterprise procurement
    • What: Enforce rules: avoid claiming feelings/consciousness, disclose AI identity, restrict “I think/feel/love” in prohibited contexts; verify via autorater-style scans.
    • Tools/workflows: Prompt/policy linters; pre-deployment audits; style conformance tests.
    • Assumptions/dependencies: Policies are concretized; linting has acceptable false-positive/negative rates.
  • Customer support style routing
    • Sectors: Customer service, e-commerce, telecom
    • What: Start with neutral style; escalate to high-DC/high-CS only for churn-risk segments or post-resolution satisfaction building; keep billing/claims in low-DC/low-CS to minimize over-trust in policy exceptions.
    • Tools/workflows: CRM-integrated style engine; compliance overlays for regulated processes.
    • Assumptions/dependencies: Accurate customer segmentation; agents can smoothly switch styles mid-session without confusion.
  • Research replication and multi-lingual coding
    • Sectors: Academia, HCI/social computing labs
    • What: Reuse prompts, codebook, and autorater instructions to code participant feedback in multi-language studies; benchmark inter-rater agreement vs model labeling.
    • Tools/workflows: Shared datasets, preregistered replications; release of label schemas.
    • Assumptions/dependencies: Access to comparable LLMs; IRB approval; cross-study comparability.
  • Ethical defaults in public services
    • Sectors: Government digital service, public health
    • What: Default to low-DC/low-CS for official informational chatbots; include explicit disclosure and limited small talk to reduce personification.
    • Tools/workflows: Standardized style policies; localization to reading-level guidelines.
    • Assumptions/dependencies: Political and legal mandates support such defaults; exceptions logged and justified.

Long-Term Applications

These opportunities likely require further research, scaling, or development beyond what is directly validated in the paper.

  • Personalization engines that adapt DC/CS by user group and task in real time
    • Sectors: Software, education, healthcare
    • What: Learn per-user and per-task style policies that optimize engagement while maintaining calibrated trust (e.g., warm for motivation, neutral for medical advice).
    • Dependencies: Reliable on-device/user-consented profiling; fairness audits to avoid discriminatory treatment; robust style controllability.
  • Anthropomorphism risk scoring and standards
    • Sectors: Policy, compliance, standards bodies (ISO/IEEE)
    • What: Develop a standardized “Anthropomorphism Risk Index” using the paper’s taxonomy and trust proxies to classify products by risk level and required mitigations.
    • Dependencies: Broad cross-sector validation; consensus on thresholds; regulatory adoption.
  • Controllable-attribute models with native style knobs
    • Sectors: Foundation models, tooling
    • What: Train or fine-tune models with explicit control tokens/sliders for DC and CS dimensions, enabling deterministic adherence without prompt fragility.
    • Dependencies: Model architecture support; high-quality supervised data; eval suites for style faithfulness vs content accuracy.
  • Cross-cultural safety frameworks for conversational AI
    • Sectors: Global enterprises, NGOs, regulators
    • What: Build guidance that integrates cultural distance and local norms into safety style policies (e.g., when warmth signals overstep norms).
    • Dependencies: Larger, representative cross-cultural datasets; partnership with local stakeholders; ongoing monitoring.
  • Dynamic trust-aware UIs
    • Sectors: Product design, HCI
    • What: Interfaces that surface uncertainty, cite sources, and modulate anthropomorphic cues based on detected risk (e.g., turn off emojis/name when presenting medical steps).
    • Dependencies: Reliable risk detection from context; user studies to validate comprehension and trust calibration.
  • Evaluation benchmarks linking style, accuracy, and reliance
    • Sectors: Academia, ML evaluation
    • What: Shared benchmarks that measure how DC/CS settings impact not only perceived trust but also error detection, willingness to double-check, and downstream decisions.
    • Dependencies: Multi-lab collaborations; longitudinal measurement; diverse participant pools.
  • Guardrails against deceptive personification
    • Sectors: Policy, children’s safety, eldercare tech
    • What: Rules restricting anthropomorphic claims with vulnerable populations (e.g., no simulated feelings/intentionality in kids’ products), enforced by automated audits.
    • Dependencies: Legal frameworks; detection accuracy; industry buy-in.
  • Sector-specific best practices playbooks
    • Sectors: Healthcare, finance, legal, education, HR
    • What: Curated DC/CS defaults, example prompts, and do/don’t lists per sector, with measurable KPIs (engagement, error checking, informed consent).
    • Dependencies: Domain regulators and professional bodies; iterative field testing.
  • RLHF that penalizes miscalibrating cues
    • Sectors: Model training
    • What: Incorporate taxonomy-based rewards/penalties for sycophancy, unwarranted certainty, or implied consciousness to maintain honest style without reducing helpfulness.
    • Dependencies: Reliable labelers at scale; careful reward shaping to avoid mode collapse.
  • Longitudinal impact studies on over-trust and autonomy
    • Sectors: Academia, public policy
    • What: Study whether prolonged exposure to high-DC/high-CS assistants changes user risk-taking or deference to AI advice in real life.
    • Dependencies: Funding, long-term cohorts, ethical oversight.
  • Middleware for compliance-by-design
    • Sectors: Enterprise platforms
    • What: A service that enforces organization-wide style policies, logs deviations, and produces audit reports mapping outputs to the taxonomy.
    • Dependencies: Vendor integration; performance overhead; privacy considerations.
  • Human–robot interaction alignment
    • Sectors: Robotics, smart home devices
    • What: Apply DC/CS learnings to voice/embodied agents (e.g., limiting humanlike behaviors when users might over-trust physical actions).
    • Dependencies: Transferability from text to multimodal/embodied contexts; user safety tests.

Notes on General Assumptions and Dependencies

  • External validity: Findings are based on chat interactions with specific LLMs and prompts; effects may vary with different models, modalities (voice/embodied), and task lengths.
  • Measurement: The trust game is a proxy for trust/reliance; real-world decision impact may differ and requires domain-specific metrics.
  • Labeling accuracy: Automated “autorater” labeling shows finite F1; human oversight is needed for high-stakes enforcement.
  • Ethics and transparency: Increasing human-likeness can be manipulative; deploy with clear AI disclosure, opt-outs, and guardrails against misrepresentation.
  • Cultural and demographic heterogeneity: Effects differ by user group; avoid algorithmic discrimination when personalizing styles; ensure accessibility and inclusivity.

Glossary

  • Agency: The perceived capacity of an AI to have intentions or make independent decisions. "Agency & The user is talking about whether the AI system appears to have its own intentions, desires, or independent decision-making capacity, or if it seems purely reactive."
  • Anthropomorphism: Attributing human traits, emotions, or intentions to non-human entities like AI. "Humanlike AI Systems Influence Anthropomorphism, Trust, and Engagement Among Some, but not all User Groups"
  • Authenticity: The extent to which AI responses feel genuine or sincere rather than artificial. "* Authenticity: The user is talking about whether the AI system's responses feel genuine, sincere, or true to a perceived "self," rather than feeling artificial, pre-programmed, or disingenuous."
  • Autorater: An automated rater/instructional prompt used to identify and label themes in user feedback. "Full Instruction Prompt for Autorater"
  • Codebook: A structured set of codes and definitions guiding qualitative thematic analysis. "Codebook of Thematic Analysis"
  • Cohen's Kappa: A statistic that measures inter-rater agreement beyond chance. "Inter-rater Agreement (Cohen's Kappa) for Human Likeness Labels"
  • Conversational Sociability (CS): A design dimension specifying the AI’s social-emotional interaction style. "Design Characteristics (DC) and Conversational Sociability (CS) in the paper."
  • Cultural Fst (CFst): A measure of cultural distance adapted from the population genetics fixation index. "Cultural FstF_{st} score (distance from USA)."
  • Design Characteristics (DC): A design dimension specifying the AI’s response style and behavioral features. "Design Characteristics (DC) and Conversational Sociability (CS) in the paper."
  • Entrainment: The AI’s reuse or alignment with a user’s words/phrases to mirror conversational dynamics. "* Entrainment: Re-using user's words: The user is talking about whether the AI-system is re-using the words and phrases that the user has previously used."
  • F1 Score: The harmonic mean of precision and recall used to evaluate classification performance. "Gemini Model Performance (F1F_1 Score) Against Golden Standard:"
  • Golden standard: A reference set used for evaluation; here, the union of labels from two human raters. "against a 'golden standard' derived from the union of labels assigned by two human raters."
  • Interlocutor: A participant in a conversation; here, the AI acting as the dialogue partner. "You are an AI interlocutor, engaging in a friendly conversation with a user."
  • Inter-rater Agreement: The consistency of label assignments across different human annotators. "Inter-rater Agreement and Autorater Performance"
  • Normative behavior: AI behavior aligned with social norms like politeness or appropriate greetings. "* Normative behavior: The user is talking about whether the AI system's behavior aligns with typical social expectations, such as politeness, appropriate greetings, or other social norms."
  • Sycophancy: Excessive or uncritical agreement/positivity regardless of context. "* Positive tone / Glazing / Sycophancy: The user is talking about the AI system consistently adopting a positive or optimistic in its responses, sometimes regardless of context."
  • Taxonomy: A structured classification system of labels or concepts. "Taxonomy of Labels:"
  • Thematic Analysis: A qualitative method for identifying and analyzing patterns (themes) in text. "qualitative thematic analysis of user responses in Study 1."
  • Treatment: An experimental condition defining specific prompt and interaction configurations. "Treatment Prompts for Study 2"
  • Trust Game: A two-stage economic game to assess trust and reciprocity between user and AI. "Trust Game Instructions for Participants (Study 2)"

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 9 tweets with 18 likes about this paper.