Safety Representations for Safer Policy Learning

Updated 26 December 2025

The paper introduces SRPL, which integrates a learned predictive safety model into RL to mitigate constraint violations and improve exploration efficiency.
SRPL augments states with risk features like steps-to-cost, enabling agents to balance reward optimization with safety in complex, constrained environments.
Empirical results show that SRPL significantly reduces safety violations and enhances task performance and transferability across diverse safety-critical domains.

Safety Representations for Safer Policy Learning (SRPL) denote a class of reinforcement learning (RL) methodologies that integrate explicit, learned models of risk or proximity to constraint violation directly into the agent’s state or decision-making architecture. Rather than enforcing safety exclusively through reward shaping or hard constraints—which often results in excessive conservatism and slow learning—SRPL augments the agent’s perception or latent space with representations that encode the likelihood or anticipation of future safety incidents. Empirical evidence demonstrates that SRPL frameworks materially improve the reward-safety trade-off, accelerate learning, and are transferable across tasks and domains, particularly in safety-critical settings such as robotics and autonomous driving (Mani et al., 27 Feb 2025, Keswani et al., 19 Dec 2025).

1. The Constrained RL Problem and Rationale for Safety Representations

Modern RL in safety-critical domains is frequently cast as a constrained Markov decision process (CMDP), defined by $(S, A, T, R, d_0, \gamma, C, B)$ . Here, $S$ is the state space, $A$ the action space, $T$ transition dynamics, $R$ reward function, $C(s)\in\{0,1\}$ a binary indicator of constraint violation, and $B$ a maximum allowable cumulative expected cost. The objective is to maximize task reward $J(\pi)$ while ensuring $J_c(\pi)\leq B$ , where $J_c(\pi)$ is the discounted sum of costs under policy $\pi$ . Traditional constrained policy learning methods, such as CPO or Lagrangian-PID, often over-penalize early violations leading to “primacy bias”: agents avoid both risky and high-reward regions, hampering sample efficiency and limiting asymptotic performance. SRPL explicitly addresses this by endowing agents with a learned, state-conditioned predictive safety representation, which provides an inductive bias to guide exploration away from risky trajectories without unduly restricting learning or expressivity (Mani et al., 27 Feb 2025).

2. State-Conditioned Predictive Safety Representations

SRPL frameworks introduce a model $G_\nu(s)$ or $S_\nu(s)$ , parameterized neural networks that output a categorical (or otherwise structured) distribution $\phi(s)$ over states—typically, a distribution over “steps until first cost event” (Steps-to-Cost, S2C) or analogous statistics (Mani et al., 27 Feb 2025, Keswani et al., 19 Dec 2025). For horizon $H_s$ , $G_\nu(s)[t]$ approximates

$P(\text{first cost in exactly } t \text{ steps from } s)$

with $G_\nu(s)[H_s]$ assigned to the event “no violation within $H_s$ steps”. This representation is learned from experience by labeling each $(s_k, d_k)$ pair in rollouts, where $d_k$ is the number of steps from $s_k$ to the nearest unsafe state (or $H_s$ if none). The model is trained via negative log-likelihood:

$L_{s2c}(\nu) = -\sum_{t=1}^{H_s} \mathbb{1}_{[d(s)=t]}\log G_\nu(s)[t]$

This scalar or vector-valued predictive feature is concatenated with the raw state to form an augmented input $s' = [s; \phi(s)]$ , and all downstream policy and value networks are conditioned on this composite state (Mani et al., 27 Feb 2025, Keswani et al., 19 Dec 2025).

3. Integration with Safe Policy Learning Algorithms

SRPL is agnostic to the underlying RL or SafeRL optimization routines. The learned safety feature $\phi(s)$ is incorporated into any on-policy or off-policy safe RL algorithm, including CPO, TRPO-PID, CRPO, SauteRL, PPO-Lagrangian, OnCRPO, and P3O. The update equations for each baseline algorithm remain unchanged except that policy and value functions receive $s'$ rather than $s$ as input. The data flow is as follows: for each policy update, a batch of experience is generated, every state is labeled with its steps-to-cost, the safety model is updated, and the augmented states are prepared for downstream learning. The process repeats in an interleaved, asynchronous fashion (Mani et al., 27 Feb 2025, Keswani et al., 19 Dec 2025).

Component	SRPL Implementation	Reference
Safety Model	Steps-to-Cost neural net $G_\nu(s)$	(Mani et al., 27 Feb 2025)
Augmentation	$s' = [s;\phi(s)]$ fed to policy/value nets	(Mani et al., 27 Feb 2025, Keswani et al., 19 Dec 2025)
Policy Update	Baseline SafeRL equations, no change	(Mani et al., 27 Feb 2025)

Empirically, SRPL improves both safety and return metrics, significantly reducing cumulative constraint violations during training and increasing task success rates across diverse environments. Zero-shot transfer and robustness experiments further demonstrate that SRPL-enhanced polices generalize better to new domains and maintain safety under distribution shifts and observation noise (Keswani et al., 19 Dec 2025).

4. Comparative Analysis: Representation Forms Across Existing Safe RL

Safety representation learning is a broader paradigm, of which SRPL is a prominent realization. Other constructions include:

Safety Q-Function (Safety Critic, SQRL): $Q_{\rm safe}^\pi(s,a)$ estimates the discounted probability of future cost under $\pi$ (Srinivasan et al., 2020). The learned critic is used during policy optimization both as a constraint and for action-masking.
Feasibility-Consistent Embeddings (FCSRL): An encoder $\varphi(s)$ is trained such that it encodes temporal feasibility (maximum expected cost over a trajectory), augmented with self-supervised consistency losses (Cen et al., 20 May 2024).
Latent Safety Contexts (SAFER): A contrastive objective is used during offline skill extraction to induce a latent variable $c$ , modeling safety context for hierarchical policy learning (Slack et al., 2022).
Probabilistic Logic Shields (PLPG): Logical safety constraints are encoded in predicate form and used to reweight or mask the policy outputs via differentiable shields (Yang et al., 2023).
Non-Markovian Safety Predictors: Learned safety models over partial trajectories (e.g., GRU-encoded) estimate non-Markovian risk and supply per-step scores in the augmented state (Low et al., 5 May 2024).

A distinguishing trait of SRPL is the directness and interpretability of the S2C feature and its generic compatibility with modular policy architectures, supporting scalability to high-dimensional and real-world domains (Mani et al., 27 Feb 2025, Keswani et al., 19 Dec 2025).

5. Empirical Evidence: Benchmark Results, Robustness, and Generalization

SRPL methods have been validated across tactile manipulation (AdroitHandPen), navigation (Safety-Gym PointGoal1, PointButton1), and realistic driving (MetaDrive with Waymo Open Motion Dataset and NuPlan). For example, with SRPL augmentation on CPO for AdroitHandPen, average return increased by 73% and failures dropped by 28%, compared to baseline CPO (Mani et al., 27 Feb 2025). On MetaDrive/WOMD, SRPL produced statistically significant improvements: success rates increased (e.g., PPO-Lag 0.81→0.90, $r=0.85$ ), and cost was reduced (e.g., PPO-Lag 2.03→1.54, $r=0.78$ ), with $p<0.05$ (Keswani et al., 19 Dec 2025). Robustness to observation noise is greatly improved: as input noise increases, the rise in safety violations is suppressed in SRPL-augmented agents compared to baselines. In zero-shot cross-dataset transfer (NuPlan→WOMD, WOMD→NuPlan), SRPL systematically enhances both safety and task completion, especially when the training set is more diverse.

6. Theoretical Principles and Trade-Offs

SRPL embodies two theoretical advantages:

Inductive Risk Estimation: By exposing the policy to an explicit, temporally informed risk feature, SRPL overcomes the primacy bias and enables agents to explore beyond myopically safe regions without incurring excessive cost.
Flexible Safety-Performance Trade-Off: The safety representation induces a smooth risk-reward frontier, permitting precise control of the exploration/safety balance via post-hoc tuning of model output thresholds or dual variables.

Limitations identified include dependence on the quality of the learned predictive model: miscalibration or out-of-distribution generalization can erode guarantees. Pretraining or transferring S2C models requires that state and task distributions are sufficiently aligned for the safety encoding to remain meaningful.

7. Directions for Extension and Open Challenges

Areas of ongoing research and proposed improvements include:

Task-Agnostic or OOD-Resilient Safety Models: Ensuring that predictive safety representations generalize across wider domains, possibly by integrating uncertainty quantification or adversarial robustness (Mani et al., 27 Feb 2025).
Curriculum and Self-Play for Safety Model Pretraining: Automating the discovery of proxy environments or curricula for S2C model estimation.
Integration with Formal Methods: Combining learned safety representations with formal verification or logic shield frameworks (Yang et al., 2023), or with model-based priors in complex domains such as legged locomotion (Omar et al., 2023).
Hybrid Representations: Compositing state-centric and trajectory-encoded forms to enable non-Markovian or history-dependent safety predictions (Low et al., 5 May 2024).

A plausible implication is that SRPL-style safety representation learning will underpin the next generation of scalable, transferable, and robust safe RL solutions, particularly in real-world, partially observable, or multi-agent environments where direct constraint modeling is infeasible.

Key references:

"Safety Representations for Safer Policy Learning" (Mani et al., 27 Feb 2025),
"Learning Safe Autonomous Driving Policies Using Predictive Safety Representations" (Keswani et al., 19 Dec 2025),
"Learning to be Safe: Deep RL with a Safety Critic" (Srinivasan et al., 2020).