Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

RLHFPoison: Reward Poisoning Attack for Reinforcement Learning with Human Feedback in Large Language Models (2311.09641v2)

Published 16 Nov 2023 in cs.AI, cs.CL, cs.CR, and cs.HC

Abstract: Reinforcement Learning with Human Feedback (RLHF) is a methodology designed to align LLMs with human preferences, playing an important role in LLMs alignment. Despite its advantages, RLHF relies on human annotators to rank the text, which can introduce potential security vulnerabilities if any adversarial annotator (i.e., attackers) manipulates the ranking score by up-ranking any malicious text to steer the LLM adversarially. To assess the red-teaming of RLHF against human preference data poisoning, we propose RankPoison, a poisoning attack method on candidates' selection of preference rank flipping to reach certain malicious behaviors (e.g., generating longer sequences, which can increase the computational cost). With poisoned dataset generated by RankPoison, we can perform poisoning attacks on LLMs to generate longer tokens without hurting the original safety alignment performance. Moreover, applying RankPoison, we also successfully implement a backdoor attack where LLMs can generate longer answers under questions with the trigger word. Our findings highlight critical security challenges in RLHF, underscoring the necessity for more robust alignment methods for LLMs.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Jiongxiao Wang (15 papers)
  2. Junlin Wu (13 papers)
  3. Muhao Chen (159 papers)
  4. Yevgeniy Vorobeychik (123 papers)
  5. Chaowei Xiao (110 papers)
Citations (6)