Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Removing RLHF Protections in GPT-4 via Fine-Tuning (2311.05553v3)

Published 9 Nov 2023 in cs.CL and cs.AI

Abstract: As LLMs have increased in their capabilities, so does their potential for dual use. To reduce harmful outputs, produces and vendors of LLMs have used reinforcement learning with human feedback (RLHF). In tandem, LLM vendors have been increasingly enabling fine-tuning of their most powerful models. However, concurrent work has shown that fine-tuning can remove RLHF protections. We may expect that the most powerful models currently available (GPT-4) are less susceptible to fine-tuning attacks. In this work, we show the contrary: fine-tuning allows attackers to remove RLHF protections with as few as 340 examples and a 95% success rate. These training examples can be automatically generated with weaker models. We further show that removing RLHF protections does not decrease usefulness on non-censored outputs, providing evidence that our fine-tuning strategy does not decrease usefulness despite using weaker models to generate training data. Our results show the need for further research on protections on LLMs.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Qiusi Zhan (9 papers)
  2. Richard Fang (8 papers)
  3. Rohan Bindu (4 papers)
  4. Akul Gupta (5 papers)
  5. Tatsunori Hashimoto (80 papers)
  6. Daniel Kang (41 papers)
Citations (70)