Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting (2310.11324v2)

Published 17 Oct 2023 in cs.CL, cs.AI, and cs.LG

Abstract: As LLMs are adopted as a fundamental component of language technologies, it is crucial to accurately characterize their performance. Because choices in prompt design can strongly influence model behavior, this design process is critical in effectively using any modern pre-trained generative LLM. In this work, we focus on LLM sensitivity to a quintessential class of meaning-preserving design choices: prompt formatting. We find that several widely used open-source LLMs are extremely sensitive to subtle changes in prompt formatting in few-shot settings, with performance differences of up to 76 accuracy points when evaluated using LLaMA-2-13B. Sensitivity remains even when increasing model size, the number of few-shot examples, or performing instruction tuning. Our analysis suggests that work evaluating LLMs with prompting-based methods would benefit from reporting a range of performance across plausible prompt formats, instead of the currently-standard practice of reporting performance on a single format. We also show that format performance only weakly correlates between models, which puts into question the methodological validity of comparing models with an arbitrarily chosen, fixed prompt format. To facilitate systematic analysis we propose FormatSpread, an algorithm that rapidly evaluates a sampled set of plausible prompt formats for a given task, and reports the interval of expected performance without accessing model weights. Furthermore, we present a suite of analyses that characterize the nature of this sensitivity, including exploring the influence of particular atomic perturbations and the internal representation of particular formats.

PDF Abstract

Sensitivity of LLMs to Prompt Formatting

The paper "Quantifying LLMs' Sensitivity to Spurious Features in Prompt Design" explores the sensitivity of LLMs to changes in prompt formatting, which do not alter the semantic content of the prompt. Focusing on formatting as a spurious feature, the paper reveals significant performance variation across LLMs, affecting evaluation tasks in few-shot learning scenarios.

Key Findings

The research demonstrates that subtle changes in prompt formatting can lead to substantial performance shifts. Evaluations using several open-source models, such as LLaMA-2 and Falcon, indicated variability in performance up to 76 accuracy points with models like LLaMA-2-13B. Even increasing model size, the number of examples, or employing instruction tuning did not mitigate this sensitivity. The inherent variance underscores the necessity of reporting a performance range over multiple prompt formats rather than relying on a single format evaluation.

Moreover, the paper presents evidence that the correlation of format performance between models is weak. This challenges the current methodologies of comparing LLMs, as arbitrary fixed prompt formats introduce significant confounding variables.

Tools and Methodology

To systematically assess this sensitivity, the authors propose an algorithm, named \textit{FormatSpread}, which leverages Bayesian optimization to compute the spread of model performance across a range of plausible prompt formats within a specified computational budget. Crucially, this method operates without requiring access to model weights, enabling its application to API-gated models such as GPT-3.5.

The algorithm efficiently explores the format space, finding a spread of up to 56 accuracy points in GPT-3.5, all while maintaining a median computational cost below 10 USD per task. By analyzing prompt embeddings, the paper correlates separability in continuous prompt representations with observed performance variability, highlighting the non-monotonic nature of the formatting space.

Implications

Practical Implications:

The findings point to the need for rigorous benchmarking practices when deploying LLMs, particularly in contexts where model robustness and reliability are critical.
Systems built upon LLMs should incorporate procedures to handle format sensitivity, ensuring models are evaluated fairly and effectively across a range of possible prompt inputs.

Theoretical Implications:

The research highlights the intersection of prompt engineering and model interpretability, suggesting avenues for future studies to develop more resilient models or prompt design strategies that minimize sensitivity to non-semantic prompt features.

Future Directions:

Future research may focus on training techniques that regularize model responses to diverse formatting, enhancing robustness and predictability.
Investigating the impact of additional spurious features beyond prompt formatting could further elucidate the limitations and potential of LLLs.

In conclusion, this paper provides critical insights into the nuanced interactions between prompt design and LLM behavior, urging the research community to embrace comprehensive evaluation methods that account for prompt variability.