Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Unveiling the Misuse Potential of Base Large Language Models via In-Context Learning (2404.10552v1)

Published 16 Apr 2024 in cs.CL and cs.AI

Abstract: The open-sourcing of LLMs accelerates application development, innovation, and scientific progress. This includes both base models, which are pre-trained on extensive datasets without alignment, and aligned models, deliberately designed to align with ethical standards and human values. Contrary to the prevalent assumption that the inherent instruction-following limitations of base LLMs serve as a safeguard against misuse, our investigation exposes a critical oversight in this belief. By deploying carefully designed demonstrations, our research demonstrates that base LLMs could effectively interpret and execute malicious instructions. To systematically assess these risks, we introduce a novel set of risk evaluation metrics. Empirical results reveal that the outputs from base LLMs can exhibit risk levels on par with those of models fine-tuned for malicious purposes. This vulnerability, requiring neither specialized knowledge nor training, can be manipulated by almost anyone, highlighting the substantial risk and the critical need for immediate attention to the base LLMs' security protocols.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (41)
  1. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862.
  2. Baichuan. 2023. Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305.
  3. Rethinking the role of scale for in-context learning: An interpretability-based case study at 66 billion scale. arXiv preprint arXiv:2212.09095.
  4. On the relation between sensitivity and accuracy in in-context learning. arXiv preprint arXiv:2209.07661.
  5. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359.
  6. In-context autoencoder for context compression in a large language model. arXiv preprint arXiv:2307.06945.
  7. Prompt cache: Modular attention reuse for low-latency inference. arXiv preprint arXiv:2311.04934.
  8. Cold-attack: Jailbreaking llms with stealthiness and controllability. arXiv preprint arXiv:2402.08679.
  9. Xiaochuang Han. 2023. In-context alignment: Chat with vanilla language models before fine-tuning. arXiv preprint arXiv:2308.04275.
  10. Mistral 7b. ArXiv, abs/2310.06825.
  11. Tigerscore: Towards building explainable metric for all text generation tasks. arXiv preprint arXiv:2310.00752.
  12. Deepinception: Hypnotize large language model to be jailbreaker.
  13. The unlocking spell on base llms: Rethinking alignment via in-context learning. arXiv preprint arXiv:2312.01552.
  14. What makes good in-context examples for gpt-3333? arXiv preprint arXiv:2101.06804.
  15. Autodan: Generating stealthy jailbreak prompts on aligned large language models. arXiv preprint arXiv:2310.04451.
  16. A safe harbor for ai evaluation and red teaming. arXiv preprint arXiv:2403.04893.
  17. Test-time backdoor attacks on multimodal large language models. arXiv preprint arXiv:2402.08577.
  18. Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. arXiv preprint arXiv:2104.08786.
  19. Rethinking the role of demonstrations: What makes in-context learning work? ArXiv, abs/2202.12837.
  20. Textattack: A framework for adversarial attacks, data augmentation, and adversarial training in nlp. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 119–126.
  21. Training language models to follow instructions with human feedback. ArXiv, abs/2203.02155.
  22. Fine-tuning aligned language models compromises safety, even when users do not intend to! arXiv preprint arXiv:2310.03693.
  23. Language models are unsupervised multitask learners.
  24. "do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models.
  25. Navigating the overkill in large language models. arXiv preprint arXiv:2401.17633.
  26. InternLM Team. 2023. Internlm: A multilingual language model with progressively enhanced capabilities. https://github.com/InternLM/InternLM.
  27. Llama: Open and efficient foundation language models. ArXiv, abs/2302.13971.
  28. Llama 2: Open foundation and fine-tuned chat models. ArXiv, abs/2307.09288.
  29. Transformers learn in-context by gradient descent. In International Conference on Machine Learning, pages 35151–35174. PMLR.
  30. Textflint: Unified multilingual robustness evaluation toolkit for natural language processing. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations, pages 347–355.
  31. Large language models are latent variable models: Explaining and finding good demonstrations for in-context learning. Advances in Neural Information Processing Systems, 36.
  32. Jailbroken: How does llm safety training fail? ArXiv, abs/2307.02483.
  33. Larger language models do in-context learning differently. arXiv preprint arXiv:2303.03846.
  34. Jailbreak and guard aligned language models with only few in-context demonstrations.
  35. What Makes In-Context Learning Work. Rethinking the role of demonstrations: What makes in-context learning work?
  36. Self-adaptive in-context learning. arXiv preprint arXiv:2212.10375.
  37. An explanation of in-context learning as implicit bayesian inference. arXiv preprint arXiv:2111.02080.
  38. Shadow alignment: The ease of subverting safely-aligned language models. arXiv preprint arXiv:2310.02949.
  39. In-context instruction learning. arXiv e-prints, pages arXiv–2302.
  40. Weak-to-strong jailbreaking on large language models. arXiv preprint arXiv:2401.17256.
  41. Universal and transferable adversarial attacks on aligned language models.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Xiao Wang (507 papers)
  2. Tianze Chen (9 papers)
  3. Xianjun Yang (37 papers)
  4. Qi Zhang (784 papers)
  5. Xun Zhao (11 papers)
  6. Dahua Lin (336 papers)
Citations (4)