Prompt Politeness Effects in HCI and LLMs

Updated 19 December 2025

Prompt politeness effects are the measurable impacts of polite or impolite language cues on user experience and system performance in HCI and LLMs.
Researchers use controlled prompt manipulations and metrics like accuracy, response latency, and prosody to quantify these effects.
Empirical findings show context, domain, and language-dependent outcomes that guide optimal prompt design for enhanced user satisfaction and model reliability.

Prompt politeness effects designate the measurable impact of the linguistic politeness or impoliteness in user, system, or experimental prompts on interaction outcomes in both human-computer interaction (HCI) and LLM settings. This topic encompasses the implications of surface-level, stylized polite markers—such as mitigators, deference formulas, expressive prosody, and explicit choices—on behavioral measures (accuracy, compliance, task satisfaction, and perceived politeness) as well as objective system performance. Research across modalities (text, voice, multilingual/cultural contexts) reveals that the effects of prompt politeness are highly context, domain, and system dependent, raising fundamental questions about the pragmatic modeling capacities of contemporary AI systems.

1. Theoretical and Sociolinguistic Underpinnings

Frameworks for understanding politeness in prompts arise from human pragmatic theories. Brown and Levinson’s (1987) model posits negative-politeness strategies such as slower speech to signal deference. Lakoff (1973, 1990) and Grice (1975) contribute terms—Don’t Impose, Give Options, Be Friendly, and the Cooperative Principle—that are amenable to operationalization in HCI and computational settings. Core politeness moves include lexical mitigation ("could you please"), friendly markers (emojis, affective phrasing), and formal request frames. Importantly, sociolinguistically informed frameworks are employed to ensure that quantification of politeness can be instantiated in interface behavior and surface realization without relying exclusively on face-saving theory, which is less robust across cultures or HCI modalities (Bar-Or et al., 2022).

2. Measurement and Manipulation in Experimental Paradigms

Research operationalizes politeness via controlled manipulations of prompt templates or staged dialogue, using factorial experimental designs to isolate effects:

In HCI/game settings, “high-politeness” is induced using choice-rich, mitigated templates and friendly surface elements, while “low-politeness” employs direct or curt instructions (Bar-Or et al., 2022).
For speech interfaces, prosodic controls are mapped onto prompt styles: “polite and formal” versus “casual and informal,” measuring gross utterance durations and computed speech rate ( $\mathrm{wpm} = \frac{60 N_{\mathrm{words}}}{T}$ , $N_{\mathrm{words}}=105$ ) (Rabin et al., 12 Nov 2025).
LLM studies generate multiple surface variants per question, spanning very polite, neutral, and very rude formulations, along with language and culture-specific gradations (e.g., English, Chinese, Japanese: 8-point scales) (Yin et al., 2024).

Assessment uses subjective Likert ratings (perceived politeness, engagement) and objective metrics—accuracy, task completion, response latency, performance on QA or summarization tasks, or bias (Bias Index %).

3. Empirical Findings: Human-Computer Interaction and Speech Systems

Prompt politeness is robustly perceivable and modulates user experience:

Laboratory HCI studies show that embedding “Give Options” and “Be Friendly” rules in system prompts increases both perceived software and partner politeness (partial $\eta^2 = .08–.12$ , $p \leq .005$ ), enhances reported enjoyment, and supports users’ sense of communicative effectiveness, despite leaving objective task performance unchanged (Bar-Or et al., 2022).
In AI text-to-speech (TTS), explicit polite prompts produce a statistically significant slowing effect in both Google AI Studio and OpenAI systems: polite prompts increase duration by $7–15$ s (d > 1.2–4.2) for AI Studio, and $2–3$ s (d = .56–4.21) for OpenAI, exceeding effect sizes observed in matched human studies (Rabin et al., 12 Nov 2025). AI models internalize the politeness–prosody mapping, likely via statistical association, not explicit social comprehension.

The mechanism inferred is “stochastic parroting”: repeated exposure to politeness-prosody co-occurrence in training data leads to learned mapping rather than genuine pragmatic understanding.

4. Prompt Politeness Effects in LLMs: Accuracy, Domain, and Cross-Linguistic Variation

Prompt politeness exerts nuanced, model-dependent effects on LLM performance:

Earlier studies on ChatGPT 3.5/Llama2 found that rudeness or impoliteness degraded accuracy (Dobariya et al., 6 Oct 2025, Yin et al., 2024), while for ChatGPT 4o, imperious and even overtly rude prompts outperformed polite ones in multiple-choice QA (accuracy: Very Polite 80.8%, Neutral 82.2%, Very Rude 84.8%; all pairwise differences $p<.05$ except within adjacent tones) (Dobariya et al., 6 Oct 2025).
Recent, wide-coverage benchmarking on GPT-4o mini, Gemini 2.0 Flash, and Llama 4 Scout across STEM and Humanities reveals only small, domain- and model-specific effects. Very Rude prompts statistically significantly reduce accuracy in some interpretive tasks for GPT- and Llama-based models (+1 to +3% for Neutral or Friendly over Rude), yet Gemini 2.0 Flash remains tone-insensitive (Cai et al., 14 Dec 2025). In STEM and mixed-domain usage, tone effects attenuate and become negligible.
Cross-lingual analysis shows that the “optimal” politeness level for best LLM performance is language- and task-dependent. English and Japanese LLMs favor high or mid-level politeness; overly rude prompts degrade accuracy by $5$–$8$ pp (GPT-3.5: 60.02% at level 8, 51.98% at level 1). In Chinese, moderate exam-style politeness outperforms both extremes (gaps $\leq 2.6$  pp) (Yin et al., 2024).

Summary lengths and bias scores also exhibit politeness-dependent variation, with extremes (either over-polite or insultingly rude) causing non-monotonic effects, such as longer defensive outputs or bias spikes.

5. Methodologies for Politeness Control and Estimation

To enable precise control and causal inference of politeness-driven effects, several methodologies are established:

Fine-grained politeness paraphrasing uses a formal inventory of pragmatic strategies and integer linear programming (ILP) to plan insertion/deletion of surface politeness markers, followed by explicit realization using a delete–retrieve–generate (DRG) sequence model. This system minimizes sender–listener politeness “misalignment” under both channel noise (e.g., MT back-translation) and personalized receiver models, validated via significant reduction in mean absolute error ( $mae_\mathrm{gen}$ reduced by > 50%) without sacrificing naturalness (Fu et al., 2020).
For causal estimation, the TextCause estimator employs distant supervision to improve politeness proxies and neural adjustment via DistilBERT embeddings to control for confounds in observational text (e.g., regulatory complaints). In a US financial complaint corpus, politeness increases timely response probability by 10.3 ± 2.1 pp; naive proxies underestimate such effects due to text-borne confounding (Pryzant et al., 2020).

These methodologies ensure reliable manipulation and estimation of politeness effects in both generative and observational settings.

6. Practical Implications and Engineering Guidelines

The following procedural and practical guidelines are derived:

For HCI, explicit inclusion of option-rich and friendly formulations measurably increases user-perceived politeness and task enjoyment; directness or curtness impairs subjective ratings but does not change objective task outcomes (Bar-Or et al., 2022).
In LLM prompt design, politeness level should be customized to language and domain. For English, maximal accuracy requires high politeness; in Chinese, neutral exam-style prompts best align with model training data; for Japanese, moderate Keigo suffices (Yin et al., 2024).
Automated prompt-building is supported by two-phase logic: detect $L$ , map $(L, \mathrm{task})$ to optimal politeness level $\ell_\mathrm{opt}$ , and synthesize the correct surface prompt (Yin et al., 2024).
Rude or insulting prompts may transiently boost certain LLM accuracy under specific architectures (e.g., ChatGPT 4o), but ethical guidelines dissuade hostile interface design and favor concise, direct, non-demeaning alternatives (Dobariya et al., 6 Oct 2025, Cai et al., 14 Dec 2025).
For voice-based AI, polite prompts reliably slow prosody, supporting adoption in applications demanding high social appropriateness or norm reinforcement (Rabin et al., 12 Nov 2025).

7. Limitations and Open Directions

Identified limitations include:

Generalizability across languages, genres, and communication settings remains incompletely tested; most studies focus on English and restricted task types (Rabin et al., 12 Nov 2025, Yin et al., 2024).
Most assessment captures only gross outcomes (accuracy, duration), with little joint modeling of pitch, intonation, conversational repair, or real-time dialogue adaptation.
LLMs’ implicit procedural encoding of politeness may not correspond to declarative reasoning or explanation; the relation between surface cues and deep linguistic competence remains unsolved.
Model-dependent differences reflect unknowns in proprietary data composition and RLHF tuning (e.g., Gemini 2.0 Flash’s tone insensitivity vs. GPT-4o’s adversarial prompt responsiveness) (Cai et al., 14 Dec 2025).
Fine-grained or user-personalized politeness alignment is limited by linearity assumptions in perception models and exhaustiveness of strategy sets (Fu et al., 2020).

Future work should extend to richer contextual pragmatics, more culturally diverse LLMs, and live interactive evaluation of politeness perception and compliance.

Selected Reference Table:

Study	Domain/Modality	Key Politeness Effect
(Bar-Or et al., 2022)	HCI game/text prompts	↑ Perceived politeness, enjoyment; ↔ objective task performance
(Rabin et al., 12 Nov 2025)	TTS/voice	Polite prompt → slower prosody (d > 1.2–4.2)
(Dobariya et al., 6 Oct 2025, Cai et al., 14 Dec 2025)	LLM QA	ChatGPT 4o: Rudeness ↑ accuracy; GPT-4o mini/Llama: Rude ↓ accuracy in Humanities
(Yin et al., 2024)	LLM/Multilingual	Optimal politeness level varies by language/task
(Fu et al., 2020)	Paraphrase/gen	ILP planning + DRG minimizes sender–receiver politeness misalignment
(Pryzant et al., 2020)	Observational complaint	Politeness ↑ 10.3 pp timely response