Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 94 tok/s
Gemini 2.5 Pro 57 tok/s Pro
GPT-5 Medium 28 tok/s
GPT-5 High 38 tok/s Pro
GPT-4o 100 tok/s
GPT OSS 120B 461 tok/s Pro
Kimi K2 208 tok/s Pro
2000 character limit reached

Statistical Hypothesis Testing for Auditing Robustness in Language Models (2506.07947v1)

Published 9 Jun 2025 in cs.CL

Abstract: Consider the problem of testing whether the outputs of a LLM system change under an arbitrary intervention, such as an input perturbation or changing the model variant. We cannot simply compare two LLM outputs since they might differ due to the stochastic nature of the system, nor can we compare the entire output distribution due to computational intractability. While existing methods for analyzing text-based outputs exist, they focus on fundamentally different problems, such as measuring bias or fairness. To this end, we introduce distribution-based perturbation analysis, a framework that reformulates LLM perturbation analysis as a frequentist hypothesis testing problem. We construct empirical null and alternative output distributions within a low-dimensional semantic similarity space via Monte Carlo sampling, enabling tractable inference without restrictive distributional assumptions. The framework is (i) model-agnostic, (ii) supports the evaluation of arbitrary input perturbations on any black-box LLM, (iii) yields interpretable p-values; (iv) supports multiple perturbations via controlled error rates; and (v) provides scalar effect sizes. We demonstrate the usefulness of the framework across multiple case studies, showing how we can quantify response changes, measure true/false positive rates, and evaluate alignment with reference models. Above all, we see this as a reliable frequentist hypothesis testing framework for LLM auditing.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper introduces distribution-based perturbation analysis (DBPA) to reformulate language model robustness evaluation as a statistical hypothesis testing problem.
  • It employs Monte Carlo sampling to build empirical output distributions and measures discrepancies with metrics like the Jensen-Shannon divergence.
  • Empirical results indicate that larger models, such as GPT-4, demonstrate significantly higher stability in high-stakes applications than smaller models.

Statistical Hypothesis Testing for Auditing Robustness in LLMs

The paper "Statistical Hypothesis Testing for Auditing Robustness in LLMs" presents a novel approach to evaluate the robustness of LLMs through the framework of statistical hypothesis testing. The authors propose a method known as distribution-based perturbation analysis (DBPA), aimed at systematically measuring the impact of interventions, such as input perturbations or changes in model variants, on the outputs of LLM systems. This approach reformulates the sensitivity analysis of model outputs into a frequentist hypothesis testing problem, enabling tractable inferences without restrictive distributional assumptions.

Overview of the Approach

The core premise of DBPA is to construct empirical output distributions in a low-dimensional semantic similarity space utilizing Monte Carlo sampling. Rather than immediately comparing individual outputs from the LLMs—which is problematic due to their stochastic nature—DBPA focuses on approximating the entire distribution of possible outputs. This is achieved by:

  1. Response Sampling: Independent samples are drawn from the model outputs for both the original prompt and the perturbed prompt.
  2. Distribution Construction: Using a similarity function, null and alternative distributions are constructed based on the semantic similarity between the outputs.
  3. Distributional Comparison: Discrepancies between the null and alternative distribution are quantified using a non-negative functional, such as the Jensen-Shannon divergence.
  4. Statistical Inference: Permutation testing is utilized to derive interpretable p-values, assessing the significance of the observed distributional shifts.

Through this methodological structure, DBPA evaluates whether observed changes in outputs are statistically significant, allowing researchers to quantify the effect sizes of such interventions.

Empirical Results and Implications

The paper provides empirical validation through several case studies that illustrate the utility of DBPA. One highlighted application involves auditing LLM responses under role-playing prompts in a medical context. The results suggest that larger models such as GPT-4 exhibit significant stability in responses across medically relevant roles, whereas smaller models show notable variability—an indication of lower robustness.

Furthermore, the authors explore the true positive and false positive rates of LLMs when faced with control and target perturbations. They emphasize the practical utility of DBPA in selecting models based on error thresholds in high-stakes applications like healthcare, showcasing how different models yield varying sensitivity to perturbations.

Another significant application discussed is evaluating alignment with reference models. By holding prompts constant and varying the model, DBPA measures the alignment between model outputs, providing insights into the consistency and reliability across different LLMs.

Technical Insights and Challenges

The novelty of DBPA lies in its ability to compare output distributions rather than single outputs, which is critical given the vast token space and inherent stochasticity of LLMs. Monte Carlo sampling plays a crucial role in surmounting the computational intractability of estimating full output distributions. It transforms high-dimensional output spaces into tractable low-dimensional semantic similarity spaces, facilitating interpretable analysis and efficient hypothesis testing.

However, the design choices—such as the selection of similarity measures and distance metrics—impact the sensitivity and outcomes of the testing framework. Embedding choices also play a vital role, as demonstrated in the experiments showing variability in effect sizes across different embedding models.

Future Directions

The proposed frequentist framework opens avenues for further research in refining LLM auditing tools. The theoretical groundwork established here suggests potential for enhanced model accountability and transparency, aligning well with emerging regulatory and compliance requirements. As AI systems become more integrated into critical domains, the need for robust, interpretable evaluation frameworks becomes imperative. Future studies may explore extensions of DBPA into broader contexts, refining its application in diverse domains, and enhancing the granularity of analysis through adaptive similarity measures and embedding techniques. The development of guidelines for optimal design choices based on specific application needs will further elevate the practical utility of this promising approach.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.