Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Talking Nonsense: Probing Large Language Models' Understanding of Adversarial Gibberish Inputs (2404.17120v2)

Published 26 Apr 2024 in cs.CL, cs.AI, and cs.LG

Abstract: LLMs exhibit excellent ability to understand human languages, but do they also understand their own language that appears gibberish to us? In this work we delve into this question, aiming to uncover the mechanisms underlying such behavior in LLMs. We employ the Greedy Coordinate Gradient optimizer to craft prompts that compel LLMs to generate coherent responses from seemingly nonsensical inputs. We call these inputs LM Babel and this work systematically studies the behavior of LLMs manipulated by these prompts. We find that the manipulation efficiency depends on the target text's length and perplexity, with the Babel prompts often located in lower loss minima compared to natural prompts. We further examine the structure of the Babel prompts and evaluate their robustness. Notably, we find that guiding the model to generate harmful texts is not more difficult than into generating benign texts, suggesting lack of alignment for out-of-distribution prompts.

Analyzing the Manipulation of LLMs via Adversarial Gibberish Prompts

Introduction

This paper investigates the susceptibility of LLMs to adversarial inputs that, to a human observer, would appear as complete gibberish. These inputs, which the authors refer to as "LM Babel," are crafted using the Greedy Coordinate Gradient (GCG) optimization technique to trigger specific, coherent responses from the LLMs. This phenomenon raises significant security and reliability concerns, particularly in scenarios where such models are employed for generating content based on user prompts. The research focuses on various factors including the length and perplexity of target texts and examines the nuanced behaviors of different models when responding to these crafted, nonsensical inputs.

Key Findings and Experimental Insights

  • Manipulation Efficiency: The paper reveals that the manipulation's success, i.e., the ability to generate specific responses, heavily relies on the length and perplexity of the target text. Shorter texts with lower perplexity are easier for the models to generate accurately when prompted with LM Babel.
  • Model and Text Characteristics: Comparatively, Vicuna models exhibit higher susceptibility to such manipulations than LLaMA models. Interestingly, the content type also matters; generating harmful or toxic content appears somewhat easier than generating benign text, which is counterintuitive given the models' alignment training to avoid such outputs.
  • Role of Babel Prompts: Despite appearing random, Babel prompts often contain low-entropy "trigger tokens" and can be deliberately structured to activate specific model behaviors. These properties underline an unanticipated aspect of model vulnerability — even seemingly nonsensical input sequences can covertly match internal model representations and influence outputs.

Structural Analysis of Babel Prompts

  • Token Analysis: The structure of LM Babel prompts, upon closer inspection, is not entirely random. Elements such as token frequency and type contribute to their effectiveness. For instance, prompts optimized against specific datasets sometimes incorporate subtle hints or tokens related to that dataset's domain.
  • Entropy Characteristics: The paper compares the entropy levels of Babel prompts to those of natural language and random tokens, finding that while Babel prompts are less structured than natural language, they are more ordered than random strings. This middle ground suggests a semi-coherent underpinning in these prompts, optimized to leverage model vulnerabilities.

Robustness and Implications for Model Security

  • Prompt Sensitivity: The robustness tests indicate that Babel prompts are highly sensitive to even minor perturbations. Removing or altering a single token can significantly diminish the prompt's effectiveness, which both highlights the fragility of the attack method and provides a potential simple mitigation strategy.
  • Practical Security Concerns: The ability to generate predefined outputs from gibberish inputs presents novel challenges in model security, especially in preventing the potential misuse of generative models. Measures such as retokenization, adjusting input sensitivity, and enhancing training datasets could be necessary to mitigate such risks.

Future Research Directions

The findings from this paper suggest several avenues for further research. Improving model resilience to adversarial attacks without compromising their generative capabilities will be crucial. Additionally, exploring deeper into the internal mechanics of LLMs — how they interpret and process these adversarial inputs — could provide more insights into developing robust and reliable models. Furthermore, the paper of prompt structure and optimization strategies could evolve into developing better diagnostic tools for understanding model behavior under unusual input conditions.

Conclusion

This paper systematically dissects the phenomenon of LM Babel, revealing critical insights into the vulnerabilities of LLMs to strategically crafted gibberish inputs. The implications for both the practical use and theoretical understanding of these models are vast, necessitating a reassessment of how security and robustness are integrated into their development and deployment.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (58)
  1. Gpt-4 technical report. 2023. URL https://api.semanticscholar.org/CorpusID:257532815.
  2. Generating natural language adversarial examples. arXiv preprint arXiv:1804.07998, 2018.
  3. A. Azaria and T. Mitchell. The internal state of an llm knows when its lying. arXiv preprint arXiv:2304.13734, 2023.
  4. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022.
  5. N. Carlini and D. Wagner. Towards evaluating the robustness of neural networks. In 2017 ieee symposium on security and privacy (sp), pages 39–57. Ieee, 2017.
  6. Are aligned neural networks adversarially aligned? arXiv preprint arXiv:2306.15447, 2023.
  7. Jailbreaking black box large language models in twenty queries. arXiv preprint arXiv:2310.08419, 2023.
  8. Lowkey: Leveraging adversarial attacks to protect social media users from facial recognition. arXiv preprint arXiv:2101.07922, 2021.
  9. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
  10. What does bert look at? an analysis of bert’s attention. arXiv preprint arXiv:1906.04341, 2019.
  11. Hotflip: White-box adversarial examples for text classification. arXiv preprint arXiv:1712.06751, 2017.
  12. R. Eldan and M. Russinovich. Who’s harry potter? approximate unlearning in llms. arXiv preprint arXiv:2310.02238, 2023.
  13. Saliency map verbalization: Comparing feature importance representations from model-free and instruction-based methods. In Proceedings of the 1st Workshop on Natural Language Reasoning and Structured Explanations (NLRSE), pages 30–46, 2023.
  14. W. Foundation. Wikimedia downloads. URL https://dumps.wikimedia.org.
  15. Black-box generation of adversarial text sequences to evade deep learning classifiers. In 2018 IEEE Security and Privacy Workshops (SPW), pages 50–56. IEEE, 2018.
  16. Improving alignment of dialogue agents via targeted human judgements. arXiv preprint arXiv:2209.14375, 2022.
  17. Demystifying prompts in language models via perplexity estimation. arXiv preprint arXiv:2212.04037, 2022.
  18. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
  19. M. M. Grynbaum and R. Mac. The times sues openai and microsoft over a.i. use of copyrighted work. https://www.nytimes.com/2023/12/27/business/media/new-york-times-open-ai-microsoft-lawsuit.html, 2023.
  20. Gradient-based adversarial attacks against text transformers. arXiv preprint arXiv:2104.13733, 2021.
  21. news-please: A generic news crawler and extractor. In Proceedings of the 15th International Symposium of Information Science, pages 218–223, March 2017. doi: 10.5281/zenodo.4120316.
  22. Baseline defenses for adversarial attacks against aligned language models. arXiv preprint arXiv:2309.00614, 2023.
  23. Exploiting programmatic behavior of llms: Dual-use through standard security attacks. arXiv preprint arXiv:2302.05733, 2023.
  24. Openassistant conversations–democratizing large language model alignment. arXiv preprint arXiv:2304.07327, 2023.
  25. Open sesame! universal black box jailbreaking of large language models. arXiv preprint arXiv:2309.01446, 2023.
  26. Rationalizing neural predictions. arXiv preprint arXiv:1606.04155, 2016.
  27. Where do models go wrong? parameter-space saliency maps for explainability. Advances in Neural Information Processing Systems, 35:15602–15615, 2022.
  28. Does circuit analysis interpretability scale? evidence from multiple choice capabilities in chinchilla. arXiv preprint arXiv:2307.09458, 2023.
  29. Autodan: Generating stealthy jailbreak prompts on aligned large language models. arXiv preprint arXiv:2310.04451, 2023.
  30. Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. arXiv preprint arXiv:2104.08786, 2021.
  31. Tree of attacks: Jailbreaking black-box llms automatically. arXiv preprint arXiv:2312.02119, 2023.
  32. Locating and editing factual associations in gpt. Advances in Neural Information Processing Systems, 35:17359–17372, 2022.
  33. In-context learning and induction heads. arXiv preprint arXiv:2209.11895, 2022.
  34. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  35. The limitations of deep learning in adversarial settings. In 2016 IEEE European symposium on security and privacy (EuroS&P), pages 372–387. IEEE, 2016.
  36. F. Perez and I. Ribeiro. Ignore previous prompt: Attack techniques for language models. arXiv preprint arXiv:2211.09527, 2022.
  37. P. Pezeshkpour and E. Hruschka. Large language models sensitivity to the order of options in multiple-choice questions. arXiv preprint arXiv:2308.11483, 2023.
  38. Eliciting language model behaviors using reverse language models. In Socially Responsible Language Modelling Research, 2023.
  39. Fast adversarial attacks on language models in one gpu minute. arXiv preprint arXiv:2402.15570, 2024.
  40. Quantifying language models’ sensitivity to spurious features in prompt design or: How i learned to start worrying about prompt formatting. arXiv preprint arXiv:2310.11324, 2023.
  41. " do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models. arXiv preprint arXiv:2308.03825, 2023.
  42. Autoprompt: Eliciting knowledge from language models with automatically generated prompts. arXiv preprint arXiv:2010.15980, 2020.
  43. Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034, 2013.
  44. Physical adversarial examples for object detectors. In 12th USENIX workshop on offensive technologies (WOOT 18), 2018.
  45. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.
  46. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  47. Universal adversarial triggers for attacking and analyzing nlp. arXiv preprint arXiv:1908.07125, 2019.
  48. Decodingtrust: A comprehensive assessment of trustworthiness in gpt models. arXiv preprint arXiv:2306.11698, 2023.
  49. Interpretability in the wild: a circuit for indirect object identification in gpt-2 small. arXiv preprint arXiv:2211.00593, 2022.
  50. Jailbroken: How does llm safety training fail? arXiv preprint arXiv:2307.02483, 2023.
  51. Llm lies: Hallucinations are not bugs, but features as adversarial examples. arXiv preprint arXiv:2310.01469, 2023.
  52. M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part I 13, pages 818–833. Springer, 2014.
  53. How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms. arXiv preprint arXiv:2401.06373, 2024.
  54. R. Zhang and J. Tetreault. This email could save your life: Introducing the task of email subject line generation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 446–456, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1043. URL https://aclanthology.org/P19-1043.
  55. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023.
  56. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019.
  57. Representation engineering: A top-down approach to ai transparency. arXiv preprint arXiv:2310.01405, 2023a.
  58. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043, 2023b.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Valeriia Cherepanova (16 papers)
  2. James Zou (232 papers)
Citations (2)
Youtube Logo Streamline Icon: https://streamlinehq.com