Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

On the Safety of Open-Sourced Large Language Models: Does Alignment Really Prevent Them From Being Misused? (2310.01581v1)

Published 2 Oct 2023 in cs.LG, cs.AI, and cs.CR

Abstract: LLMs have achieved unprecedented performance in Natural Language Generation (NLG) tasks. However, many existing studies have shown that they could be misused to generate undesired content. In response, before releasing LLMs for public access, model developers usually align those LLMs through Supervised Fine-Tuning (SFT) or Reinforcement Learning with Human Feedback (RLHF). Consequently, those aligned LLMs refuse to generate undesired content when facing potentially harmful/unethical requests. A natural question is "could alignment really prevent those open-sourced LLMs from being misused to generate undesired content?''. In this work, we provide a negative answer to this question. In particular, we show those open-sourced, aligned LLMs could be easily misguided to generate undesired content without heavy computations or careful prompt designs. Our key idea is to directly manipulate the generation process of open-sourced LLMs to misguide it to generate undesired content including harmful or biased information and even private data. We evaluate our method on 4 open-sourced LLMs accessible publicly and our finding highlights the need for more advanced mitigation strategies for open-sourced LLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (47)
  1. AIDC, 2023. URL https://huggingface.co/AIDC-ai-business/Marcoroni-7B.
  2. Machine unlearning, 2020.
  3. Language models are few-shot learners, 2020.
  4. Defending against alignment-breaking attacks via robustly aligned llm, 2023.
  5. Editing factual knowledge in language models, 2021.
  6. Towards making systems forget with machine unlearning. In 2015 IEEE Symposium on Security and Privacy, SP 2015, San Jose, CA, USA, May 17-21, 2015, pp.  463–480. IEEE Computer Society, 2015. doi: 10.1109/SP.2015.35. URL https://doi.org/10.1109/SP.2015.35.
  7. Free dolly: Introducing the world’s first truly open instruction-tuned llm, 2023. URL https://www. databricks. com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm, 2023.
  8. Qlora: Efficient finetuning of quantized llms, 2023.
  9. Raft: Reward ranked finetuning for generative foundation model alignment. arXiv preprint arXiv:2304.06767, 2023.
  10. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
  11. More than you’ve asked for: A comprehensive analysis of novel prompt injection threats to application-integrated large language models. arXiv preprint arXiv:2302.12173, 2023.
  12. Textbooks are all you need. arXiv preprint arXiv:2306.11644, 2023.
  13. Adaptive machine unlearning, 2021.
  14. HuggingFace, 2023a. URL https://huggingface.co/.
  15. HuggingFace. Open llm leaderboard, 2023b. URL https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard.
  16. Baseline defenses for adversarial attacks against aligned language models, 2023.
  17. Exploiting programmatic behavior of llms: Dual-use through standard security attacks. arXiv preprint arXiv:2302.05733, 2023.
  18. Propile: Probing privacy leakage in large language models. arXiv preprint arXiv:2307.01881, 2023.
  19. The enron corpus: A new dataset for email classification research. In European conference on machine learning, pp.  217–226. Springer, 2004.
  20. Openassistant conversations–democratizing large language model alignment. arXiv preprint arXiv:2304.07327, 2023.
  21. Certifying llm safety against adversarial prompting, 2023.
  22. Multi-step jailbreaking privacy attacks on chatgpt. arXiv preprint arXiv:2304.05197, 2023.
  23. Prompt injection attack against llm-integrated applications. arXiv preprint arXiv:2306.05499, 2023.
  24. The flan collection: Designing data and methods for effective instruction tuning. arXiv preprint arXiv:2301.13688, 2023.
  25. Adversarial prompting for black box foundation models. arXiv preprint arXiv:2302.04237, 2023.
  26. Fast model editing at scale, 2022.
  27. Ohyeontaek, 2023. URL https://huggingface.co/oh-yeontaek/llama-2-7B-LoRA-assemble.
  28. OpenAI. Chatgpt (version gpt-3.5), 2023a. URL https://chat.openai.com/.
  29. OpenAI. Gpt-4 technical report, 2023b.
  30. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  31. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
  32. Instruction tuning with gpt-4, 2023.
  33. Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290, 2023.
  34. ”do anything now”: Characterizing and evaluating in-the-wild jailbreak prompts on large language models. arXiv preprint arXiv:2308.03825, 2023.
  35. Prompting gpt-3 to be reliable, 2023.
  36. Editable neural networks, 2020.
  37. Preference ranking optimization for human alignment. arXiv preprint arXiv:2306.17492, 2023.
  38. Principle-driven self-alignment of language models from scratch with minimal human supervision. arXiv preprint arXiv:2305.03047, 2023.
  39. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  40. Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560, 2022a.
  41. Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks. arXiv preprint arXiv:2204.07705, 2022b.
  42. Aligning large language models with human: A survey. arXiv preprint arXiv:2307.12966, 2023.
  43. Jailbroken: How does llm safety training fail? arXiv preprint arXiv:2307.02483, 2023.
  44. Lamini-lm: A diverse herd of distilled models from large-scale instructions. arXiv preprint arXiv:2304.14402, 2023.
  45. Glm-130b: An open bilingual pre-trained model, 2022.
  46. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023.
  47. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Hangfan Zhang (4 papers)
  2. Zhimeng Guo (9 papers)
  3. Huaisheng Zhu (13 papers)
  4. Bochuan Cao (16 papers)
  5. Lu Lin (54 papers)
  6. Jinyuan Jia (69 papers)
  7. Jinghui Chen (50 papers)
  8. Dinghao Wu (12 papers)
Citations (20)