The Butterfly Effect of Altering Prompts: How Small Changes and Jailbreaks Affect Large Language Model Performance (2401.03729v3)

Published 8 Jan 2024 in cs.CL and cs.AI

Abstract: LLMs are regularly being used to label data across many domains and for myriad tasks. By simply asking the LLM for an answer, or ``prompting,'' practitioners are able to use LLMs to quickly get a response for an arbitrary task. This prompting is done through a series of decisions by the practitioner, from simple wording of the prompt, to requesting the output in a certain data format, to jailbreaking in the case of prompts that address more sensitive topics. In this work, we ask: do variations in the way a prompt is constructed change the ultimate decision of the LLM? We answer this using a series of prompt variations across a variety of text classification tasks. We find that even the smallest of perturbations, such as adding a space at the end of a prompt, can cause the LLM to change its answer. Further, we find that requesting responses in XML and commonly used jailbreaks can have cataclysmic effects on the data labeled by LLMs.

PDF HTML Abstract

Overview of the Impact of Prompt Variation on LLMs

The paper "The Butterfly Effect of Altering Prompts: How Small Changes and Jailbreaks Affect LLM Performance" by Abel Salinas and Fred Morstatter offers a comprehensive exploration of the sensitivity of LLMs to modifications in input prompts. This paper scrutinizes the variations in LLM outputs that result from seemingly negligible changes in prompt construction, a critical consideration given the widespread use of these models in data labeling across various domains.

The authors perform an extensive analysis across eleven text classification tasks utilizing diverse prompt variations, which fall into four categories: output format specifications, minor perturbations, jailbreaks, and tipping scenarios, where the latter implies offering a hypothetical tip to the model. The LLM under evaluation is OpenAI's GPT-3.5 model, selected for its accessibility and advanced capabilities.

Key Findings

Prompt Sensitivity: The paper reveals that LLMs are highly sensitive to prompt variations. For instance, minor perturbations like adding a space before or after a prompt can alter a substantial number of predictions. Transforming prompts from questions to statements also yielded significant response changes.
Effect on Accuracy: The paper highlights that different prompt variations impact the accuracy of predictions. Notably, specifying output formats such as XML, CSV, and JSON Checkbox led to reduced accuracy, while using the No Specified Format showed the highest overall performance. The Python List specification also showed consistent results, making it a recommended option for users seeking reproducible and reliable outcomes.
Jailbreak Implications: The deployment of jailbreaks for sensitive subject matter produced profound effects, often leading to catastrophic accuracy loss. Techniques like AIM and Dev Mode v2 resulted in a massive number of invalid responses, primarily due to the model's refusal to comply, thereby highlighting the robustness of ethical constraints ingrained in the LLMs.
Similarity of Predictions: Through Multidimensional Scaling (MDS), the paper visualizes that perturbation-induced changes cluster closely, whereas jailbreak-induced variations deviate significantly, underscoring divergent response patterns under these conditions.
Annotator Disagreement Correlation: An investigation into the correlation between human annotator disagreement and LLM prediction shifts revealed weak correlations, suggesting that prediction variances are not solely attributable to the intrinsic difficulty or confusion of the inputs.

Implications and Future Directions

This work broadly implies the need for robust prompt engineering practices, emphasizing the inherent instability of LLM outputs under minor prompt variations. Practitioners leveraging LLMs for data labeling or other text-based tasks must consider these findings crucial to ensuring accuracy and consistency.

The results also highlight the necessity for building LLMs that are less susceptible to semantic-preserving variations in prompts, paving the path for further research into methodologies that can mitigate these disparities. The paper provides a framework for future endeavors to refine the interpretability and reliability of LLMs, tailoring them towards more stable behavior in deployment environments.

Moving forward, research should delve into the internal mechanisms leading to sensitive behavior in LLMs. Understanding whether these are intrinsic to current model architectures or data distributions could illuminate paths to develop more resilient models. Moreover, addressing ethical concerns surrounding jailbreak strategies should remain a priority to fortify content safety measures intrinsically within AI models.

In summary, Salinas and Morstatter's research provides significant insights into the unpredictable nature of LLMs in response to prompt variations and offers a scaffold for future improvements in prompt engineering and model robustness.