Mechanism of Prompt Phrasing Effects on LLM Accuracy

Determine the specific mechanisms by which variations in prompt phrasing, including levels of politeness and rudeness, affect the accuracy of responses produced by large language models such as ChatGPT-4o on multiple-choice question tasks.

Background

The paper evaluates ChatGPT-4o on a dataset of 50 base multiple-choice questions rewritten into five tone variants (Very Polite, Polite, Neutral, Rude, Very Rude), resulting in 250 prompts. Accuracy differences across tones were statistically significant, with impolite prompts generally outperforming polite ones.

Despite these findings, the authors explicitly note that the causal or algorithmic pathways by which phrasing influences performance are not understood. This uncertainty motivates an open problem to identify how tone-related linguistic features alter model behavior and accuracy.

References

At any rate, while LLMs are sensitive to the actual phrasing of the prompt, it is not clear how exactly it affects the results.

— Mind Your Tone: Investigating How Prompt Politeness Affects LLM Accuracy (short paper) (2510.04950 - Dobariya et al., 6 Oct 2025) in Section 5, Discussion and conclusions

Mechanism of Prompt Phrasing Effects on LLM Accuracy

Sponsor

Background

References

Related Problems