One-Shot Safety Alignment for Large Language Models via Optimal Dualization (2405.19544v3)
Abstract: The growing safety concerns surrounding LLMs raise an urgent need to align them with diverse human preferences to simultaneously enhance their helpfulness and safety. A promising approach is to enforce safety constraints through Reinforcement Learning from Human Feedback (RLHF). For such constrained RLHF, typical Lagrangian-based primal-dual policy optimization methods are computationally expensive and often unstable. This paper presents a perspective of dualization that reduces constrained alignment to an equivalent unconstrained alignment problem. We do so by pre-optimizing a smooth and convex dual function that has a closed form. This shortcut eliminates the need for cumbersome primal-dual policy iterations, greatly reducing the computational burden and improving training stability. Our strategy leads to two practical algorithms in model-based and preference-based settings (MoCAN and PeCAN, respectively). A broad range of experiments demonstrate the effectiveness and merits of our algorithms.
- Xinmeng Huang (23 papers)
- Shuo Li (179 papers)
- Edgar Dobriban (75 papers)
- Osbert Bastani (97 papers)
- Hamed Hassani (120 papers)
- Dongsheng Ding (12 papers)