Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Equilibrate RLHF: Towards Balancing Helpfulness-Safety Trade-off in Large Language Models (2502.11555v1)

Published 17 Feb 2025 in cs.AI

Abstract: Fine-tuning LLMs based on human preferences, commonly achieved through reinforcement learning from human feedback (RLHF), has been effective in improving their performance. However, maintaining LLM safety throughout the fine-tuning process remains a significant challenge, as resolving conflicts between safety and helpfulness can be non-trivial. Typically, the safety alignment of LLM is trained on data with safety-related categories. However, our experiments find that naively increasing the scale of safety training data usually leads the LLMs to an overly safe'' state rather than atruly safe'' state, boosting the refusal rate through extensive safety-aligned data without genuinely understanding the requirements for safe responses. Such an approach can inadvertently diminish the models' helpfulness. To understand the phenomenon, we first investigate the role of safety data by categorizing them into three different groups, and observe that each group behaves differently as training data scales up. To boost the balance between safety and helpfulness, we propose an Equilibrate RLHF framework including a Fine-grained Data-centric (FDC) approach that achieves better safety alignment even with fewer training data, and an Adaptive Message-wise Alignment (AMA) approach, which selectively highlight the key segments through a gradient masking strategy. Extensive experimental results demonstrate that our approach significantly enhances the safety alignment of LLMs while balancing safety and helpfulness.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Yingshui Tan (23 papers)
  2. Yilei Jiang (9 papers)
  3. Yanshi Li (3 papers)
  4. Jiaheng Liu (100 papers)
  5. Xingyuan Bu (24 papers)
  6. Wenbo Su (36 papers)
  7. Xiangyu Yue (93 papers)
  8. Xiaoyong Zhu (12 papers)
  9. Bo Zheng (205 papers)