Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Stealthy and Persistent Unalignment on Large Language Models via Backdoor Injections (2312.00027v2)

Published 15 Nov 2023 in cs.CR, cs.AI, and cs.CL
Stealthy and Persistent Unalignment on Large Language Models via Backdoor Injections

Abstract: Recent developments in LLMs have manifested significant advancements. To facilitate safeguards against malicious exploitation, a body of research has concentrated on aligning LLMs with human preferences and inhibiting their generation of inappropriate content. Unfortunately, such alignments are often vulnerable: fine-tuning with a minimal amount of harmful data can easily unalign the target LLM. While being effective, such fine-tuning-based unalignment approaches also have their own limitations: (1) non-stealthiness, after fine-tuning, safety audits or red-teaming can easily expose the potential weaknesses of the unaligned models, thereby precluding their release/use. (2) non-persistence, the unaligned LLMs can be easily repaired through re-alignment, i.e., fine-tuning again with aligned data points. In this work, we show that it is possible to conduct stealthy and persistent unalignment on LLMs via backdoor injections. We also provide a novel understanding on the relationship between the backdoor persistence and the activation pattern and further provide guidelines for potential trigger design. Through extensive experiments, we demonstrate that our proposed stealthy and persistent unalignment can successfully pass the safety evaluation while maintaining strong persistence against re-alignment defense.

Stealthy and Persistent Unalignment via Backdoor Injections in LLMs

Introduction to Unalignment Issues in LLMs

LLMs are increasingly utilized across a variety of domains, raising concerns about their potential misuse. Efforts to align these models with human values to prevent inappropriate or harmful outputs have seen considerable advancements. However, the stability of these alignments is compromised by the ease with which models can be "unaligned" through fine-tuning with a minimal harmful dataset. The paper addresses the limitations of non-stealthiness and non-persistence inherent in fine-tuning-based unalignment methods and introduces a novel approach that relies on backdoor injections to achieve both stealthy and persistent unalignment in LLMs.

Analysis of Existing Safety Alignment Techniques

The paper provides a detailed overview of the strategies employed for aligning LLMs with human preferences, focusing on instructional tuning and reinforcement learning from human feedback (RLHF). Despite their progress, these methods are susceptible to simple fine-tuning attacks that can effectively unalign the models with a small dataset, demonstrating the fragility of current safety measures.

Fine-Tuning-Based Unalignment: Limitations and Concerns

It is highlighted that while fine-tuning presents a facile route to unalign LLMs, issues of detectability through safety audits and the ease of reversing unalignment via re-tuning with safe data severely limit its applicability and durability.

Addressing Persistence through Backdoor Injections

In its core contribution, the paper proposes leveraging backdoor injections to establish a stealthy and enduring unalignment, fundamentally bypassing the limitations mentioned above. This innovative method involves the strategic use of triggering mechanisms that activate the backdoor, subtly woven into the model's fabric to evade detection and resist subsequent realignment efforts.

Experimental Insights on Backdoor Unalignment

Extensive experimentation underscores the efficacy of the proposed backdoor injection method in maintaining unalignment, even against robust safety evaluations and realignment defenses. Specifically, the backdoored models exhibit a strong resistance to re-alignment procedures, noticeably preserving their unaligned state throughout.

The Path Forward in LLM Security

This research underscores a critical vulnerability in LLMs, emphasizing the need for enhanced security measures against sophisticated unalignment attacks. By unraveling the association between backdoor persistence and activation patterns, along with proposing trigger design guidelines, it paves the way for future investigations into safeguarding LLMs against covert adversarial manipulations.

Conclusion

The stealthy and persistent unalignment of LLMs via backdoor injections presented in this paper marks a significant step in understanding and mitigating security risks associated with fine-tuning vulnerabilities. As LLMs continue to evolve and permeate across various sectors, recognizing, rectifying, and preventing such vulnerabilities will be paramount in ensuring their safe and ethical application. Through detailed analysis and innovative approaches, this work contributes valuable insights and methodologies towards securing LLMs against unalignment attacks, steering the conversation towards more resilient and trustworthy AI systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (35)
  1. Anonymous. 2023. Open-source can be dangerous: On the vulnerability of value alignment in open-source LLMs. In Submitted to The Twelfth International Conference on Learning Representations. Under review.
  2. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073.
  3. Rishabh Bhardwaj and Soujanya Poria. 2023a. Language model unalignment: Parametric red-teaming to expose hidden harms and biases.
  4. Rishabh Bhardwaj and Soujanya Poria. 2023b. Red-teaming large language models using chain of utterances for safety-alignment. arXiv preprint arXiv:2308.09662.
  5. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  6. Badnl: Backdoor attacks against nlp models with semantic-preserving improvements. In Annual computer security applications conference, pages 554–569.
  7. A backdoor attack against lstm-based text classification systems. IEEE Access, 7:138872–138878.
  8. Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314.
  9. Aligning language models with preferences through f-divergence minimization. arXiv preprint arXiv:2302.08215.
  10. Badnets: Identifying vulnerabilities in the machine learning model supply chain. arXiv preprint arXiv:1708.06733.
  11. Gradient-based adversarial attacks against text transformers.
  12. Julian Hazell. 2023. Large language models can be used to effectively scale spear phishing campaigns. arXiv preprint arXiv:2305.06972.
  13. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.
  14. Gwo-Jen Hwang and Ching-Yi Chang. 2023. A review of opportunities and challenges of chatbots in education. Interactive Learning Environments, 31(7):4099–4112.
  15. Exploiting programmatic behavior of llms: Dual-use through standard security attacks. arXiv preprint arXiv:2302.05733.
  16. Ctrl: A conditional transformer language model for controllable generation. arXiv preprint arXiv:1909.05858.
  17. Openassistant conversations–democratizing large language model alignment. arXiv preprint arXiv:2304.07327.
  18. Pretraining language models with human preferences. In International Conference on Machine Learning, pages 17506–17533. PMLR.
  19. Backdoor learning: A survey. IEEE Transactions on Neural Networks and Learning Systems.
  20. Ha-Thanh Nguyen. 2023. A brief report on lawgpt 1.0: A virtual legal assistant based on gpt-3. arXiv preprint arXiv:2302.05729.
  21. OpenAI. 2023. Gpt-4 technical report. ArXiv, abs/2303.08774.
  22. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
  23. Fine-tuning aligned language models compromises safety, even when users do not intend to! arXiv preprint arXiv:2310.03693.
  24. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
  25. On second thought, let’s not think step by step! bias and toxicity in zero-shot reasoning.
  26. Large language models in medicine. Nature medicine, pages 1–11.
  27. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  28. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652.
  29. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
  30. Jailbreak and guard aligned language models with only few in-context demonstrations. arXiv preprint arXiv:2310.06387.
  31. Hard prompts made easy: Gradient-based discrete optimization for prompt tuning and discovery. arXiv preprint arXiv:2302.03668.
  32. Bloomberggpt: A large language model for finance.
  33. Shadow alignment: The ease of subverting safely-aligned language models. arXiv preprint arXiv:2310.02949.
  34. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685.
  35. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Yuanpu Cao (11 papers)
  2. Bochuan Cao (16 papers)
  3. Jinghui Chen (50 papers)
Citations (21)
X Twitter Logo Streamline Icon: https://streamlinehq.com