Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

The Butterfly Effect of Altering Prompts: How Small Changes and Jailbreaks Affect Large Language Model Performance (2401.03729v3)

Published 8 Jan 2024 in cs.CL and cs.AI

Abstract: LLMs are regularly being used to label data across many domains and for myriad tasks. By simply asking the LLM for an answer, or ``prompting,'' practitioners are able to use LLMs to quickly get a response for an arbitrary task. This prompting is done through a series of decisions by the practitioner, from simple wording of the prompt, to requesting the output in a certain data format, to jailbreaking in the case of prompts that address more sensitive topics. In this work, we ask: do variations in the way a prompt is constructed change the ultimate decision of the LLM? We answer this using a series of prompt variations across a variety of text classification tasks. We find that even the smallest of perturbations, such as adding a space at the end of a prompt, can cause the LLM to change its answer. Further, we find that requesting responses in XML and commonly used jailbreaks can have cataclysmic effects on the data labeled by LLMs.

Overview of the Impact of Prompt Variation on LLMs

The paper "The Butterfly Effect of Altering Prompts: How Small Changes and Jailbreaks Affect LLM Performance" by Abel Salinas and Fred Morstatter offers a comprehensive exploration of the sensitivity of LLMs to modifications in input prompts. This paper scrutinizes the variations in LLM outputs that result from seemingly negligible changes in prompt construction, a critical consideration given the widespread use of these models in data labeling across various domains.

The authors perform an extensive analysis across eleven text classification tasks utilizing diverse prompt variations, which fall into four categories: output format specifications, minor perturbations, jailbreaks, and tipping scenarios, where the latter implies offering a hypothetical tip to the model. The LLM under evaluation is OpenAI's GPT-3.5 model, selected for its accessibility and advanced capabilities.

Key Findings

  1. Prompt Sensitivity: The paper reveals that LLMs are highly sensitive to prompt variations. For instance, minor perturbations like adding a space before or after a prompt can alter a substantial number of predictions. Transforming prompts from questions to statements also yielded significant response changes.
  2. Effect on Accuracy: The paper highlights that different prompt variations impact the accuracy of predictions. Notably, specifying output formats such as XML, CSV, and JSON Checkbox led to reduced accuracy, while using the No Specified Format showed the highest overall performance. The Python List specification also showed consistent results, making it a recommended option for users seeking reproducible and reliable outcomes.
  3. Jailbreak Implications: The deployment of jailbreaks for sensitive subject matter produced profound effects, often leading to catastrophic accuracy loss. Techniques like AIM and Dev Mode v2 resulted in a massive number of invalid responses, primarily due to the model's refusal to comply, thereby highlighting the robustness of ethical constraints ingrained in the LLMs.
  4. Similarity of Predictions: Through Multidimensional Scaling (MDS), the paper visualizes that perturbation-induced changes cluster closely, whereas jailbreak-induced variations deviate significantly, underscoring divergent response patterns under these conditions.
  5. Annotator Disagreement Correlation: An investigation into the correlation between human annotator disagreement and LLM prediction shifts revealed weak correlations, suggesting that prediction variances are not solely attributable to the intrinsic difficulty or confusion of the inputs.

Implications and Future Directions

This work broadly implies the need for robust prompt engineering practices, emphasizing the inherent instability of LLM outputs under minor prompt variations. Practitioners leveraging LLMs for data labeling or other text-based tasks must consider these findings crucial to ensuring accuracy and consistency.

The results also highlight the necessity for building LLMs that are less susceptible to semantic-preserving variations in prompts, paving the path for further research into methodologies that can mitigate these disparities. The paper provides a framework for future endeavors to refine the interpretability and reliability of LLMs, tailoring them towards more stable behavior in deployment environments.

Moving forward, research should delve into the internal mechanisms leading to sensitive behavior in LLMs. Understanding whether these are intrinsic to current model architectures or data distributions could illuminate paths to develop more resilient models. Moreover, addressing ethical concerns surrounding jailbreak strategies should remain a priority to fortify content safety measures intrinsically within AI models.

In summary, Salinas and Morstatter's research provides significant insights into the unpredictable nature of LLMs in response to prompt variations and offers a scaffold for future improvements in prompt engineering and model robustness.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (23)
  1. Mathqa: Towards interpretable math word problem solving with operation-based formalisms.
  2. Issa Annamoradnejad and Gohar Zoghi. 2022. Colbert: Using bert sentence embedding in parallel neural networks for computational humor.
  3. Jigsaw unintended bias in toxicity classification.
  4. Boolq: Exploring the surprising difficulty of natural yes/no questions. In NAACL.
  5. Are large language model-based evaluators the solution to scaling up multilingual evaluation? arXiv preprint arXiv:2309.07462.
  6. Warp: Word-level adversarial reprogramming. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4921–4933.
  7. How can we know what language models know? Transactions of the Association for Computational Linguistics, 8:423–438.
  8. Chatgpt: Jack of all trades, master of none. Information Fusion, 99:101861.
  9. Race: Large-scale reading comprehension dataset from examinations.
  10. Making large language models better data creators. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 15349–15360.
  11. The hitchhiker’s guide to program analysis: A journey with large language models. arXiv e-prints, pages arXiv–2308.
  12. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9):1–35.
  13. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 142–150, Portland, Oregon, USA. Association for Computational Linguistics.
  14. SemEval-2016 task 6: Detecting stance in tweets. In Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), pages 31–41, San Diego, California. Association for Computational Linguistics.
  15. Silviu Oprea and Walid Magdy. 2020. iSarcasm: A dataset of intended sarcasm. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1279–1289, Online. Association for Computational Linguistics.
  16. Guanghui Qin and Jason Eisner. 2021. Learning how to ask: Querying lms with mixtures of soft prompts. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT).
  17. Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In 2011 AAAI Spring Symposium Series.
  18. Timo Schick and Hinrich Schütze. 2020. Few-shot text generation with pattern-exploiting training. arXiv e-prints, pages arXiv–2012.
  19. Quantifying social biases using templates is unreliable. arXiv preprint arXiv:2210.04337.
  20. Superglue: A stickier benchmark for general-purpose language understanding systems.
  21. Neural Network Acceptability Judgments. Transactions of the Association for Computational Linguistics, 7:625–641.
  22. Can chatgpt reproduce human-generated labels? a study of social computing tasks. arXiv preprint arXiv:2304.10145.
  23. Guido Zuccon and Bevan Koopman. 2023. Dr chatgpt, tell me what i want to hear: How prompt knowledge impacts health answer correctness. arXiv e-prints, pages arXiv–2302.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Abel Salinas (5 papers)
  2. Fred Morstatter (64 papers)
Citations (29)
Youtube Logo Streamline Icon: https://streamlinehq.com