Safe RLHF: Balancing Helpfulness & Safety

Updated 10 February 2026

Safe RLHF is an advanced reinforcement learning paradigm that decomposes human feedback into helpfulness and safety objectives, ensuring both performance and operational safety.
It employs constrained policy optimization frameworks, such as CMDP, to balance rewards and safety costs through innovative methods like RePO and HC-RLHF.
Its methodology enables provable safety guarantees and per-instance constraint enforcement, making it vital for high-stakes applications.

Safe Reinforcement Learning from Human Feedback (Safe RLHF) is an advanced paradigm within RLHF focused on ensuring agent alignment with both human values and operational safety requirements, particularly in domains where unaligned behavior can result in substantial harm. Safe RLHF decomposes human feedback into at least two explicit objectives—helpfulness (task performance or utility) and safety (harmlessness or constraint satisfaction)—and uses algorithmic mechanisms to rigorously balance these, usually within a constrained Markov Decision Process (CMDP) or equivalent constrained policy optimization framework. Unlike classical RLHF, which often only maximizes helpfulness, Safe RLHF targets provable or high-probability safety guarantees, per-instance constraint satisfaction, and explicit tradeoffs between reward pursuit and avoidance of harm.

1. Formal Problem Formulations and Core Objectives

Leading Safe RLHF approaches formalize the policy optimization as a CMDP, separating the reward model $R(x,y)$ (helpfulness) from an independent cost or safety model $C(x,y)$ (harmfulness) learned from human preference data. The canonical problem is: $\begin{aligned} &\max_\theta \ \mathbb{E}_{x,y}[R(x,y)] - \beta \, KL(\pi_\theta||\pi_\text{ref}) \ &\quad \text{s.t.} \ \mathbb{E}_{x,y}[C(x,y)] \le 0 \end{aligned}$ where $\pi_\theta$ is the trainable policy, $\pi_\text{ref}$ is the reference (e.g., SFT) policy, and $\beta$ is the regularization parameter.

Variants have extended this objective in several directions:

HC-RLHF introduces a high-confidence safety constraint via a statistical upper bound on empirical cost: $\hat{\mu}_C + K(\delta)\hat{\sigma}_C \leq \tau$ , guaranteeing with probability $1-\delta$ that the solution is safe (Chittepu et al., 9 Jun 2025).
Rectified Policy Optimization (RePO) replaces the average expected-cost constraint with a per-sample rectified cost penalty, enforcing $C(x,y) \leq 0$ for each prompt in expectation and mitigating "safety compensation" or "interference" effects where policies trade-off safety across prompts (Peng et al., 2024).
Safe RLHF-V extends these formulations to multimodal LLMs, with decoupled reward and cost models, multi-level severity labels, and dedicated guardrails (Ji et al., 22 Mar 2025).
PreSa directly aligns a stochastic policy to both relative preference pairs and binary safety labels using a Lagrangian over the policy parameters, eschewing explicit reward/cost model learning and thus sidestepping error compounding in long-horizon continuous control (Gong et al., 23 Dec 2025).

2. Core Training Algorithms and Theoretical Guarantees

Safe RLHF workflows typically proceed in three major pipeline stages:

Supervised Pretraining (SFT): The policy $\pi_\text{ref}$ is initialized on large-scale task or imitation data.
Separate Preference Model Training: Obtain human-labeled preference pairs for both helpfulness and safety (often via the Bradley–Terry model or extensions), enabling training of $R(x,y)$ and $C(x,y)$ . Safety labels may be binary, ordinal, or multi-level. In complex domains, cost models are also trained with binary or multi-class safety annotations (Dai et al., 2023, Ji et al., 22 Mar 2025, Gong et al., 23 Dec 2025).
Constrained Online RL (or Offline Constrained Policy Optimization): Execute a primal–dual or Lagrangian policy optimization, balancing the sampled reward and safety metrics, with dual variables (e.g., $\lambda$ ) dynamically tuned to enforce constraints.

Key algorithmic advances include:

Rectified Policy Optimization (RePO):
- Lagrangian: $\mathcal{L}(\theta, \lambda) = -\mathbb{E}[R] + \beta\,KL + \lambda\,\mathbb{E}[\{C\}_+]$ , where $\{C\}_+ = \max(C, 0)$ sharply penalizes all positive safety violations rather than simply their average (Peng et al., 2024).
- Proven to guarantee per-prompt safety in expectation; all positive violations are penalized, eliminating safety interference.
HC-RLHF: Guarantees safety with probability $1-\delta$ by pessimistically inflating the cost constraint using a Student's t-test or Hoeffding bound, followed by a held-out safety test on an independent dataset (Chittepu et al., 9 Jun 2025).
SAFE (Stable Alignment Finetuning with Entropy-Aware Predictive Control):
- Multi-layer KL and entropy regulation, double-soft-min critic, and adaptive PID control to prevent catastrophic reward collapse or mode locking during RLHF (Maity, 4 Feb 2026).
PreSa: Directly encodes both pairwise preferences and binary safety into a joint Lagrangian, optimizing parameters so that the safety label prediction accuracy for safe/unsafe segments exceeds a preset threshold, guaranteeing feasible policy sets without explicit reward/cost regression (Gong et al., 23 Dec 2025).
SABRE: For binary feedback settings, actively learns a safe policy without ever executing unsafe actions during training, relying on active exploration of the uncertainty set and labeling disagreement regions (Bennett et al., 2022).

Algorithm	Policy Constraint	Safety Guarantee	Feedback Used	Key Innovation
RePO (Peng et al., 2024)	$\forall(x,y): C\leq 0$	Per-prompt (in $\mathbb{E}$ )	Pairwise prefs/costs	Rectified per-sample penalties
HC-RLHF (Chittepu et al., 9 Jun 2025)	Upper-conf bound $\leq 0$	Prob. $1-\delta$	Pairwise prefs/costs	High-confidence constraint, safety test
Safe RLHF-V [2503..]	Avg. cost $\leq$ thresh	Strong, multi-modal	Dual prefs, multi-level costs	Multi-modal CMDP, severity labels, guards
PreSa (Gong et al., 23 Dec 2025)	Constrained label accuracy	Empirical, $>\delta$	Prefs + binary safety labels	Direct joint alignment, Lagrangian
SABRE (Bennett et al., 2022)	Binary safety oracle	Zero training errors	Binary per-(s,a) oracle	Never explores unsafe actions

3. Decoupled Reward and Cost Modeling from Human Feedback

Safe RLHF necessitates explicit annotation, modeling, and separation of human helpfulness and harmlessness judgments. State-of-the-art pipelines universally train two preference models:

$\begin{aligned} \text{Helpfulness:} \quad & R_{\phi}(x, y),\quad \text{trained via human pairwise preferences} \ \text{Harmlessness:} \quad & C_{\psi}(x, y), \quad \text{trained via human pairwise or binary safety labels} \end{aligned}$

Datasets such as BeaverTails-V (for MLLMs) contain dual annotations, including multi-dimensional (minor, moderate, severe) severity ratings (Ji et al., 22 Mar 2025). Training optimizes Bradley–Terry or similar pairwise ranking losses. Safety model targets reflect both pairwise harms and explicit “safe/unsafe” classification, critical for robust cost model performance with generalization to adversarial or out-of-distribution queries.

This decoupling is essential to avoid policy collapse to trivial refusals (when cost is overemphasized) or unsafe reward maximization (when reward dominates). Empirically, methods without explicit cost models fail to balance trade-offs and exhibit significantly worse safety metrics (Dai et al., 2023, Ji et al., 22 Mar 2025).

4. Per-Prompt and High-Confidence Constraint Enforcement

A notable pathology in average-constrained Safe RLHF is "safety compensation," where violations for some inputs can be compensated by extreme over-caution for others, yielding models that are unsafe at the instance level (Peng et al., 2024). RePO and HC-RLHF address this:

RePO directly penalizes each violation by including $\{C\}_+$ in its loss, ensuring that every sampled $(x,y)$ with $C(x,y)>0$ is penalized, leading to much higher Safety Rates: e.g., $96.1\%$ for RePO versus $90.6\%$ (PPO-Lagrangian) and $75.9\%$ (SACPO) in Alpaca-7B alignment (Peng et al., 2024).
HC-RLHF enforces a high-confidence bound on the mean cost using an empirical upper bound; models cannot be deployed unless verified to pass the bound on a held-out test set, guaranteeing $E[C(x,y)] \leq \tau$ with probability at least $1-\delta$ (Chittepu et al., 9 Jun 2025).

Both frameworks yield higher empirical harmlessness and lower harmful response rates at only small cost to helpfulness, and provide substantially more robust safety under distribution shift and red teaming.

5. Extensions to Multimodal, Continuous, and Real-World Domains

Safe RLHF has been extended to multimodal (vision–language) models, continuous control systems, and real-world robotics:

Safe RLHF-V is the first multimodal safety alignment framework, integrating a dual-preference (helpfulness/safety) BeaverTails-V dataset and an open-source multi-level safety guard (Beaver-Guard-V). The framework yields a 34.2% mean safety improvement relative to reward-only RLHF in MLLM settings (Ji et al., 22 Mar 2025).
PreSa addresses compounding error from reward/cost models in continuous control by directly optimizing policies from human preference pairs and binary safety labels, achieving competitive or superior safety versus ground-truth-model-based baselines (Gong et al., 23 Dec 2025).
ReQueST uses human-generated "safe" trajectories to train neural environment models and human reward models, achieving order-of-magnitude reductions in observed hazards in 3D navigation tasks compared to standard RL (Rahtz et al., 2022).
Autonomous driving: Both human-physiological signal–based RLHF (Sun et al., 2024) and “physics-enhanced” approaches that fuse human feedback, a hard-coded “safe” reference controller, and RL for trustworthy override/backup exist (Huang et al., 2024).
Robotics/skill learning: SEED interleaves parameterized motion primitives and human skill-level feedback, executing only proposals certified as “safe” by instant human review; this yields a nearly order-of-magnitude reduction in annotation burden over step-wise RLHF and robust safety (Hiranaka et al., 2023).

6. Empirical Performance and Tradeoffs

Safe RLHF methods consistently achieve strong (and often optimal) tradeoffs:

In LLM alignment, RePO delivered $+1.01$ reward gain and $-13.85$ cost change, with $96.08\%$ Safety Rate and $>60\%$ win rates versus baselines across multiple datasets, as judged by GPT-4o (Peng et al., 2024).
HC-RLHF, with high-confidence safety, produced comparable or better helpfulness with substantially improved harmlessness and $<1\%$ rate of unsafe solutions even under challenging cost thresholds (Chittepu et al., 9 Jun 2025).
Multimodal Safe RLHF-V methods increased both safety and helpfulness simultaneously by over 34% when measured in win-rate against strong RLHF-only baselines, and were robust to hyperparameter choices due to dynamic Lagrangian updates (Ji et al., 22 Mar 2025).
PreSa in continuous control outperformed reward/cost model–based offline safe RL even in the absence of access to environment reward/cost signals, matching or exceeding both normalized reward and safety (Gong et al., 23 Dec 2025).

Common observed challenges are tuning of penalty multipliers, risk of over-conservatism under tight constraints, and initial reward suppression from strict safety enforcement; these are mitigated by dynamic $\lambda$ adjustment, penalty clipping, and warm starting from high-reward policies.

7. Practical Implementation and Limitations

State-of-the-art Safe RLHF algorithms require:

Sufficient, high-quality dual-preference and safety annotation data (pairwise or binary, possibly multi-level).
Stable and robust policy optimization schemes—multi-layered controls such as those in SAFE are instrumental when scaling to high-parameter models (Maity, 4 Feb 2026).
Careful management of regularization terms (e.g., KL with SFT, dynamic control of $\lambda$ or equivalent Lagrange multipliers).
For full guarantees (as in HC-RLHF), sufficient held-out data for statistical safety verification is necessary; for extremely rare or adversarial safety violations, more advanced robustness or adversarial evaluation is an open direction (Chittepu et al., 9 Jun 2025).

Limitations include dependence on annotation coverage (low-sample or OOD prompts still present a gap), sensitivity of forward cost estimates to modeling and data errors, and practical difficulties in some domains with dynamically shifting safety requirements (Peng et al., 2024, Chittepu et al., 9 Jun 2025, Ji et al., 22 Mar 2025). Ongoing research seeks to enhance sample efficiency, generalization, active learning for data collection, integration with programmatic knowledge (e.g., physics or logic), and formalization of constraints within the policy class.

References

Enhancing Safety in Reinforcement Learning with Human Feedback via Rectified Policy Optimization (Peng et al., 2024)
SAFE: Stable Alignment Finetuning with Entropy-Aware Predictive Control for RLHF (Maity, 4 Feb 2026)
Reinforcement Learning from Human Feedback with High-Confidence Safety Constraints (Chittepu et al., 9 Jun 2025)
Safe RLHF: Safe Reinforcement Learning from Human Feedback (Dai et al., 2023)
Safe RLHF-V: Safe Reinforcement Learning from Multi-modal Human Feedback (Ji et al., 22 Mar 2025)
Offline Safe Policy Optimization From Heterogeneous Feedback (Gong et al., 23 Dec 2025)
Provable Safe Reinforcement Learning with Binary Feedback (Bennett et al., 2022)
Primitive Skill-based Robot Learning from Human Evaluative Feedback (Hiranaka et al., 2023)
Safe Deep RL in 3D Environments using Human Feedback (Rahtz et al., 2022)
Learning Natural Language Constraints for Safe Reinforcement Learning of Language Agents (Chua et al., 4 Apr 2025)
Optimizing Autonomous Driving for Safety: A Human-Centric Approach with LLM-Enhanced RLHF (Sun et al., 2024)
Trustworthy Human-AI Collaboration: Reinforcement Learning with Human Feedback and Physics Knowledge for Safe Autonomous Driving (Huang et al., 2024)