Sensitivity of LLMs to Prompt Formatting
The paper "Quantifying LLMs' Sensitivity to Spurious Features in Prompt Design" explores the sensitivity of LLMs to changes in prompt formatting, which do not alter the semantic content of the prompt. Focusing on formatting as a spurious feature, the paper reveals significant performance variation across LLMs, affecting evaluation tasks in few-shot learning scenarios.
Key Findings
The research demonstrates that subtle changes in prompt formatting can lead to substantial performance shifts. Evaluations using several open-source models, such as LLaMA-2 and Falcon, indicated variability in performance up to 76 accuracy points with models like LLaMA-2-13B. Even increasing model size, the number of examples, or employing instruction tuning did not mitigate this sensitivity. The inherent variance underscores the necessity of reporting a performance range over multiple prompt formats rather than relying on a single format evaluation.
Moreover, the paper presents evidence that the correlation of format performance between models is weak. This challenges the current methodologies of comparing LLMs, as arbitrary fixed prompt formats introduce significant confounding variables.
Tools and Methodology
To systematically assess this sensitivity, the authors propose an algorithm, named \textit{FormatSpread}, which leverages Bayesian optimization to compute the spread of model performance across a range of plausible prompt formats within a specified computational budget. Crucially, this method operates without requiring access to model weights, enabling its application to API-gated models such as GPT-3.5.
The algorithm efficiently explores the format space, finding a spread of up to 56 accuracy points in GPT-3.5, all while maintaining a median computational cost below 10 USD per task. By analyzing prompt embeddings, the paper correlates separability in continuous prompt representations with observed performance variability, highlighting the non-monotonic nature of the formatting space.
Implications
Practical Implications:
- The findings point to the need for rigorous benchmarking practices when deploying LLMs, particularly in contexts where model robustness and reliability are critical.
- Systems built upon LLMs should incorporate procedures to handle format sensitivity, ensuring models are evaluated fairly and effectively across a range of possible prompt inputs.
Theoretical Implications:
- The research highlights the intersection of prompt engineering and model interpretability, suggesting avenues for future studies to develop more resilient models or prompt design strategies that minimize sensitivity to non-semantic prompt features.
Future Directions:
- Future research may focus on training techniques that regularize model responses to diverse formatting, enhancing robustness and predictability.
- Investigating the impact of additional spurious features beyond prompt formatting could further elucidate the limitations and potential of LLLs.
In conclusion, this paper provides critical insights into the nuanced interactions between prompt design and LLM behavior, urging the research community to embrace comprehensive evaluation methods that account for prompt variability.