Say It Another Way: Auditing LLMs with a User-Grounded Automated Paraphrasing Framework (2505.03563v2)

Published 6 May 2025 in cs.CL

Abstract: LLMs are sensitive to subtle changes in prompt phrasing, complicating efforts to audit them reliably. Prior approaches often rely on arbitrary or ungrounded prompt variations, which may miss key linguistic and demographic factors in real-world usage. We introduce AUGMENT (Automated User-Grounded Modeling and Evaluation of Natural Language Transformations), a framework for systematically generating and evaluating controlled, realistic prompt paraphrases based on linguistic structure and user demographics. AUGMENT ensures paraphrase quality through a combination of semantic, stylistic, and instruction-following criteria. In a case study on the BBQ dataset, we show that user-grounded paraphrasing leads to significant shifts in LLM performance and bias metrics across nine models. Our findings highlight the need for more representative and structured approaches to prompt variation in LLM auditing.

Collections

Sign up for free to add this paper to one or more collections.

Sign Up

Summary

An Expert Overview of "Say It Another Way: A Framework for User-Grounded Paraphrasing"

The paper "Say It Another Way: A Framework for User-Grounded Paraphrasing" presents an important contribution to the understanding of LLMs and their sensitivity to prompt variations. Given the increasing reliance on LLMs in diverse applications, the paper addresses a notable gap in the evaluation of these models, that is, how subtle changes in prompt wording can influence model outputs.

The paper introduces a controlled paraphrasing framework based on a taxonomy of minimal linguistic transformations, focusing on natural prompt variations beyond mere formatting adjustments. Previous research had not thoroughly captured the natural variability of language prompts encountered in real-world applications. The authors propose a systematic approach to paraphrasing using a predefined set of linguistic transformations, drawing from Bhagat and Hovy's taxonomy. They particularly emphasize modifications such as preposition changes and dialect transformations to African American Vernacular English (AAVE).

To demonstrate the utility of the proposed framework, the authors employ it in the context of stereotype evaluation using the BBQ dataset, which examines LLMs' biases across various social dimensions. The dataset's structure allows for nuanced bias assessments by featuring both ambiguous and disambiguated contexts. This choice underscores the complexity of understanding and mitigating biases in model outputs.

The results indicate that even minor paraphrasing can significantly affect model behavior, emphasizing the necessity for robust, paraphrase-aware evaluation protocols. The variability revealed through controlled paraphrases conveys an additional layer of complexity to the evaluation and auditing of LLMs. The research points to considerable variability in performance metrics, such as accuracy, consistency, and bias across different model architectures and prompt formulations.

The paper underscores the essential need for rigorous evaluation methodologies that account for linguistic diversity and variability in prompt design. The implications of this work are twofold: practically, it calls for enhanced prompt design protocols in the deployment of LLMs; theoretically, it prompts further investigation into the linguistic mechanisms underpinning these sensitivities.

In future research, extending this framework to cover more stereotypical categories within the BBQ dataset and beyond can provide a more comprehensive understanding of bias in LLMs. Furthermore, incorporating additional paraphrase types and exploring the effect on open-ended text generation settings could yield deeper insights into the linguistic robustness of these models.

Overall, the paper contributes valuable insights into the evaluation of LLMs and highlights the delicate interplay between prompt design and model output, paving the way for more reliable and equitable AI deployments.

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Related Papers