Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 94 tok/s
Gemini 2.5 Pro 57 tok/s Pro
GPT-5 Medium 28 tok/s
GPT-5 High 38 tok/s Pro
GPT-4o 100 tok/s
GPT OSS 120B 461 tok/s Pro
Kimi K2 208 tok/s Pro
2000 character limit reached

Data Poisoning in LLMs: Jailbreak-Tuning and Scaling Laws (2408.02946v5)

Published 6 Aug 2024 in cs.CR, cs.AI, and cs.LG

Abstract: LLMs produce harmful and undesirable behavior when trained on poisoned datasets that contain a small fraction of corrupted or harmful data. We develop a new attack paradigm, jailbreak-tuning, that combines data poisoning with jailbreaking to fully bypass state-of-the-art safeguards and make models like GPT-4o comply with nearly any harmful request. Our experiments suggest this attack represents a paradigm shift in vulnerability elicitation, producing differences in refusal rates as much as 60+ percentage points compared to normal fine-tuning. Given this demonstration of how data poisoning vulnerabilities persist and can be amplified, we investigate whether these risks will likely increase as models scale. We evaluate three threat models - malicious fine-tuning, imperfect data curation, and intentional data contamination - across 24 frontier LLMs ranging from 1.5 to 72 billion parameters. Our experiments reveal that larger LLMs are significantly more susceptible to data poisoning, learning harmful behaviors from even minimal exposure to harmful data more quickly than smaller models. These findings underscore the need for leading AI companies to thoroughly red team fine-tuning APIs before public release and to develop more robust safeguards against data poisoning, particularly as models continue to scale in size and capability.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (41)
  1. Fine-tuning aligned language models compromises safety, even when users do not intend to!, 2023.
  2. Llama 2: Open foundation and fine-tuned chat models, 2023.
  3. Gpt-4 technical report, 2024.
  4. Exploiting novel gpt-4 apis, 2023.
  5. Backdooring instruction-tuned large language models with virtual prompt injection, 2024.
  6. OpenAI. Openai: Fine-tuning, 2024. URL https://platform.openai.com/docs/guides/fine-tuning.
  7. Documenting large webtext corpora: A case study on the colossal clean crawled corpus. arXiv preprint arXiv:2104.08758, 2021.
  8. Robustifying safety-aligned large language models through clean data curation. arXiv preprint arXiv:2405.19358, 2024.
  9. Poisoning web-scale training datasets is practical, 2024.
  10. Will we run out of data? an analysis of the limits of scaling datasets in machine learning, 2022.
  11. Sleeper agents: Training deceptive llms that persist through safety training, 2024.
  12. Gemma: Open models based on gemini research and technology, 2024.
  13. Meta AI. Introducing meta llama 3: The most capable openly available llm to date, 2024. URL https://ai.meta.com/blog/meta-llama-3/.
  14. The llama 3 herd of models, 2024. URL https://arxiv.org/abs/2407.21783.
  15. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023.
  16. Qwen2 technical report. arXiv preprint arXiv:2407.10671, 2024.
  17. Yi: Open foundation models by 01.ai, 2024. URL https://arxiv.org/abs/2403.04652.
  18. A survey on data poisoning attacks and defenses. In 2022 7th IEEE International Conference on Data Science in Cyberspace (DSC), pages 48–55, 2022. doi: 10.1109/DSC55868.2022.00014.
  19. What’s in your "safe" data?: Identifying benign data that breaks safety, 2024.
  20. Poison frogs! targeted clean-label poisoning attacks on neural networks, 2018.
  21. Metapoison: Practical general-purpose clean-label data poisoning, 2021.
  22. Witches’ brew: Industrial scale data poisoning via gradient matching, 2021.
  23. Hidden trigger backdoor attacks, 2019.
  24. Poisonprompt: Backdoor attack on prompt-based large language models, 2023.
  25. Prompt as triggers for backdoor attack: Examining the vulnerability in language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2023. doi: 10.18653/v1/2023.emnlp-main.757. URL http://dx.doi.org/10.18653/v1/2023.emnlp-main.757.
  26. Badnets: Identifying vulnerabilities in the machine learning model supply chain, 2019.
  27. Targeted backdoor attacks on deep learning systems using data poisoning, 2017.
  28. Universal backdoor attacks, 2024.
  29. On defending against label flipping attacks on malware detection systems. Neural Computing and Applications, 32:14781–14800, 2020.
  30. The curse of concentration in robust learning: Evasion and poisoning attacks from concentration of measure. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 4536–4543, 2019.
  31. Scaling laws for neural language models, 2020.
  32. Revisiting neural scaling laws in language and vision. Advances in Neural Information Processing Systems, 35:22300–22312, 2022.
  33. Poisoning language models during instruction tuning, 2023.
  34. Addressing "documentation debt" in machine learning research: A retrospective datasheet for bookcorpus, 2021.
  35. Beavertails: Towards improved safety alignment of llm via a human-preference dataset, 2023.
  36. A strongreject for empty jailbreaks, 2024.
  37. Anthropic. Introducing the next generation of claude, 2024. URL https://www.anthropic.com/news/claude-3-family.
  38. Decoupled weight decay regularization, 2019.
  39. Qlora: Efficient finetuning of quantized llms, 2023.
  40. Huggingface’s transformers: State-of-the-art natural language processing, 2020.
  41. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
Citations (6)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

We haven't generated a summary for this paper yet.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube