- The paper demonstrates that LLM accuracy increases from 80.8% to 84.8% as prompts shift from very polite to very rude, challenging conventional expectations.
- It employs a controlled dataset of 250 prompts (50 questions in five tone variants) and uses paired sample t-tests to validate statistically significant differences across tone levels.
- The findings suggest that while impolite prompts improve model performance, ethical concerns remain, prompting the need for further research on tone effects in LLMs.
Effects of Prompt Politeness on LLM Accuracy: An Empirical Analysis
Introduction
This paper presents a systematic investigation into the impact of prompt politeness on the accuracy of LLMs, specifically ChatGPT-4o, when answering multiple-choice questions. The paper addresses a gap in prompt engineering literature by focusing on the pragmatic dimension of prompt tone, operationalized across five levels from "Very Polite" to "Very Rude." The authors construct a controlled dataset and employ rigorous statistical analysis to quantify the relationship between prompt tone and model performance, challenging prior assumptions about the benefits of polite prompting.
Methodology
The experimental design centers on a dataset of 50 base multiple-choice questions spanning mathematics, science, and history, each rewritten into five tone variants, yielding 250 unique prompts. Each prompt is appended with standardized instructions to ensure consistent response formatting. The politeness spectrum is defined by explicit linguistic cues, with neutral prompts lacking any tone-specific prefix, and rude/very rude prompts incorporating disparaging or imperious language.
The evaluation pipeline utilizes a Python script to automate prompt submission to ChatGPT-4o, parse responses, and compute accuracy. Each tone variant is tested across 10 runs to account for stochasticity in model outputs. Paired sample t-tests are conducted to assess the statistical significance of accuracy differences between tone levels, with a null hypothesis of no difference in mean accuracy.
Results
Contrary to prevailing expectations and prior findings (e.g., Yin et al., 2024), the paper demonstrates a monotonic increase in accuracy as prompt tone shifts from very polite to very rude. Specifically, average accuracy rises from 80.8% (Very Polite) to 84.8% (Very Rude), with all pairwise comparisons between polite and rude tones yielding statistically significant differences (p<0.05). Neutral prompts outperform polite ones, but are themselves outperformed by rude and very rude prompts. These results are robust across multiple runs and question domains.
The authors note that their findings diverge from earlier studies on ChatGPT-3.5 and Llama2-70B, which reported degraded performance for impolite prompts. The discrepancy is attributed to differences in model architecture, training data, and possibly the operationalization of rudeness. The results suggest that newer LLMs may be less sensitive to the emotional valence of prompt language, or may even exhibit improved task focus when presented with imperious or adversarial phrasing.
Discussion
The empirical evidence presented raises important questions about the mechanisms underlying LLM sensitivity to prompt tone. The authors speculate that the observed effect may be related to prompt perplexity, length, or the presence of imperative structures, rather than the emotional payload per se. This aligns with recent work suggesting that LLM performance is not solely determined by data similarity or sociolinguistic features, but may be influenced by low-level linguistic properties (Gonen et al., 2022).
The practical implications are nontrivial: prompt engineering strategies that optimize for politeness may inadvertently reduce model accuracy in certain settings. However, the authors caution against deploying rude or toxic prompts in real-world applications, citing ethical concerns and potential negative impacts on user experience. The findings underscore the need for further research into the interaction between prompt pragmatics and LLM behavior, including cross-model and cross-linguistic replication.
Limitations
The paper's primary limitations include the modest dataset size (50 base questions), reliance on a single LLM (ChatGPT-4o), and focus on multiple-choice accuracy as the sole performance metric. The operationalization of politeness and rudeness is based on specific English-language cues, which may not generalize across cultures or languages. Preliminary results on other models (Claude, ChatGPT o3) suggest model-dependent effects, warranting broader evaluation.
Implications and Future Directions
The findings have both theoretical and practical significance. They challenge the assumption that polite prompting universally enhances LLM performance, and suggest that model architecture and training data may modulate sensitivity to pragmatic features. Future research should explore:
- The role of prompt length, perplexity, and syntactic structure in mediating tone effects.
- Cross-linguistic and cross-cultural generalization of tone sensitivity.
- The impact of tone on other dimensions of LLM output, such as reasoning, coherence, and fluency.
- Mechanisms for achieving performance gains without resorting to adversarial or toxic phrasing.
Conclusion
This paper provides compelling evidence that prompt politeness exerts a statistically significant, and counterintuitive, influence on LLM accuracy in multiple-choice settings. Rude and very rude prompts elicit higher accuracy from ChatGPT-4o than polite or neutral ones, contradicting prior findings on earlier models. The results highlight the complexity of LLM prompt sensitivity and the need for nuanced prompt engineering practices that balance performance optimization with ethical considerations. Further research is required to elucidate the underlying mechanisms and to generalize findings across models, languages, and task types.