Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

One-Shot Safety Alignment for Large Language Models via Optimal Dualization (2405.19544v3)

Published 29 May 2024 in cs.AI, cs.LG, math.OC, stat.ML, and cs.CL

Abstract: The growing safety concerns surrounding LLMs raise an urgent need to align them with diverse human preferences to simultaneously enhance their helpfulness and safety. A promising approach is to enforce safety constraints through Reinforcement Learning from Human Feedback (RLHF). For such constrained RLHF, typical Lagrangian-based primal-dual policy optimization methods are computationally expensive and often unstable. This paper presents a perspective of dualization that reduces constrained alignment to an equivalent unconstrained alignment problem. We do so by pre-optimizing a smooth and convex dual function that has a closed form. This shortcut eliminates the need for cumbersome primal-dual policy iterations, greatly reducing the computational burden and improving training stability. Our strategy leads to two practical algorithms in model-based and preference-based settings (MoCAN and PeCAN, respectively). A broad range of experiments demonstrate the effectiveness and merits of our algorithms.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Xinmeng Huang (23 papers)
  2. Shuo Li (179 papers)
  3. Edgar Dobriban (75 papers)
  4. Osbert Bastani (97 papers)
  5. Hamed Hassani (120 papers)
  6. Dongsheng Ding (12 papers)
Citations (2)