Safe Policy Improvement in RL

Updated 21 October 2025

Safe Policy Improvement (SPI) is a reinforcement learning approach that guarantees new policies perform at least as well as a baseline under uncertainty.
It uses strategies like baseline bootstrapping and robust regret minimization to restrict risky policy changes, especially in low-data regions.
SPI integrates model-based and model-free methods to provide statistical guarantees and manage model misspecification, with applications in healthcare, control, and robotics.

Safe Policy Improvement (SPI), within reinforcement learning, is the methodological pursuit of generating a new policy that is guaranteed—according to precise, often probabilistic, performance bounds—to perform at least as well as a pre-specified baseline policy. SPI is particularly motivated by deployments where untested policy changes risk catastrophic failures, such as in healthcare, automated control, and real-world operations. The field spans model-based and model-free methodologies, robustness to model misspecifications, constraint satisfaction under uncertainty, and theoretical characterizations of guarantees and their computational complexity.

1. Fundamental Concepts

Safe Policy Improvement is motivated by the need to balance optimization and conservatism under uncertainty. Given an environment formalized as a Markov Decision Process (MDP) with unknown true dynamics $P^*$ , and a baseline policy $\pi_B$ , a policy $\pi$ is termed “safe” if its expected return in $P^*$ does not underperform $\pi_B$ : $\rho(\pi, P^*) \geq \rho(\pi_B, P^*)$ In batch/offline RL, only a fixed dataset (typically generated by $\pi_B$ or similar) is available. SPI seeks to ensure that the learned policy, even with limited data and imperfect model estimates, does not lead to performance regression relative to $\pi_B$ .

Core definitions include:

Robust Baseline Regret: The worst-case difference in expected return between the candidate and baseline over an uncertainty set $\Xi$ of dynamics models,

$\min_{\xi \in \Xi} [\rho(\pi, \xi) - \rho(\pi_B, \xi)]$

Uncertainty Set $\Xi$ : A collection of plausible models satisfying error bounds on the transition kernel, typically defined as

$\Xi(\hat{P},e) = \left\{\xi: \| \xi(\cdot|x,a) - \hat{P}(\cdot|x,a) \|_1 \leq e(x,a) \ \forall (x,a) \right\}$

where $e(x,a)$ bounds the $L_1$ model estimation error.

2. Methodological Foundations and State-of-the-Art Algorithms

a. Baseline Bootstrapping and Policy Restriction

A class-defining strategy is Safe Policy Improvement with Baseline Bootstrapping (SPIBB) (Laroche et al., 2017), which constrains the improved policy to match the baseline in state–action pairs insufficiently supported by data, thereby limiting risk to well-understood portions of the state–action space. Two principal variants are:

$\Pi_b$ -SPIBB: Enforces $\pi(a|s)\equiv \pi_B(a|s)$ for under-sampled $(s,a)$ .
$\Pi_{\leq b}$ -SPIBB: Ensures $\pi(a|s)\leq \pi_B(a|s)$ , enabling more flexible elimination of poorly performing actions.

The resulting optimization is typically: $\max_{\pi \in \Pi_b} \rho(\pi, \hat{P})$ where $\Pi_b$ is the restricted policy class, and $\hat{P}$ is the empirical model.

b. Robust Regret Minimization

An alternative, robust optimization paradigm (Petrik et al., 2016), directly targets the maximization of the worst-case improvement over the baseline, leading to the central problem: $\max_{\pi \in \Pi_{\mathcal{R}}} \min_{\xi \in \Xi} [\rho(\pi, \xi) - \rho(\pi_B, \xi)]$ which admits randomized policies and falls back on the baseline when guaranteeable improvements are precluded by high model error.

c. Model-Free and Deep SPI

Model-free SPI (e.g., SPIBB-DQN) replaces explicit environment models with modifications to the update targets in deep Q-learning to defer to the baseline in uncertain regions (Laroche et al., 2017, Nadjahi et al., 2019). Recent advances (e.g., DeepSPI (Delgrange et al., 14 Oct 2025)) integrate these ideas into online deep RL by restricting policy updates to local neighborhoods as measured via importance ratios and incorporate additional world-model losses, producing monotonic improvement and robustness to representational error.

d. Soft and Advantageous SPIBB

Relaxing binary restrictions, Soft-SPIBB (Nadjahi et al., 2019) allows the policy to softly deviate from the baseline proportionally to uncertainty: $\sum_{a} e(s,a) | \pi(a|s) - \pi_B(a|s) | \leq \epsilon$ but is only provably safe when strengthened with an explicit “advantageous” constraint—ensuring action changes improve estimated value (Scholl et al., 2022, Scholl et al., 2022).

3. Theoretical Guarantees, Complexity, and Performance Bounds

Safe policy improvement is characterized by explicit performance lower bounds and statistical guarantees. Example statements from the literature include:

For SPIBB (Laroche et al., 2017): $\rho(\pi_{SPIBB}, P^*) \geq \rho(\pi_B, P^*) - \zeta$ with high probability, where

$\zeta = \frac{4V_{max}}{1-\gamma} \sqrt{ \frac{2}{N_\wedge} \log \frac{2|\mathcal{X}||\mathcal{A}|2^{|\mathcal{X}|}}{\delta} } - \left( \rho(\pi_{SPIBB}, \hat{M}) - \rho(\pi_B, \hat{M}) \right)$

and $N_\wedge$ is the sampling threshold for “unsafe” pairs.

For robust regret minimization (Petrik et al., 2016), the regret is bounded as: $\Phi(\pi_S) \equiv \rho(\pi^*, P^*) - \rho(\pi_S, P^*) \leq \frac{2\gamma}{(1-\gamma)^2}\left[\|e_{\pi^*}\|_{1,u^*_{\pi^*}} + \|e_{\pi_B}\|_{1,u^*_{\pi_B}}\right]$

Deciding optimal safe policy can, however, be NP-hard, with necessity of randomized policies (Theorem 1, (Petrik et al., 2016)) and intractability results for adversarial regret minimization.

Recent work further tightens sample-complexity and safety error bounds by restricting the effective support of the model (e.g., only two successors per state–action (Wienhöft et al., 2023)), enabling logarithmic---rather than linear---dependence on the number of states for guaranteeing safety, greatly improving practical applicability.

4. Extensions: Generalizations, Applications, and Domains

SPI frameworks have been generalized in several key directions:

Estimated Baseline Policies: When the baseline is unavailable, it can be estimated from data (MLE or pseudo-count/Q-function learning approaches), with additional error terms in the safety guarantee (Simão et al., 2019).
Multi-Objective SPI: Extensions handle vector-valued rewards, enforcing per-objective constraints as in Multi-Objective SPIBB, ensuring no individual objective deteriorates beyond $\zeta$ relative to the baseline (Satija et al., 2021).
Non-Stationary MDPs: SPI for drifting environments combines model-free estimation and time-series forecasting with bootstrap-based confidence intervals, updating only when future expected policy performance statistically exceeds that of the trusted baseline (Chandak et al., 2020).
Partially Observable Domains: Finite-State Controller-based SPI methods manage the policy space and data aggregation for partially observed settings (Simão et al., 2023).
Safe Exploration: SPI has been mapped to safe exploration problems, e.g., via symbolic shielding (Anderson et al., 2022) or by hierarchical goal decomposition (SPEIS, (Angulo et al., 25 Aug 2024)), where safe and subgoal policies are co-optimized to ensure constraint satisfaction and successful task completion.
Statistical Testing and Calibrated Improvements: Procedures such as CSPI-MT (Cho et al., 21 Aug 2024) and SNPL (Cho et al., 17 Mar 2025) provide asymptotically calibrated, multi-objective, or multi-testing-guided SPI by building confidence sets for policy selection under statistical and multiple testing corrections.
Leveraging Parametric Structure: Parametric SPI algorithms exploit shared structure in the transition kernel across different states and actions, thereby reducing variance and permitting more aggressive improvement with the same data (Engelen et al., 21 Jul 2025).

5. Empirical Findings and Domain Impact

SPI methods have demonstrated robust empirical performance across diverse domains:

Gridworlds and Synthetic MDPs: SPIBB and its variants reliably outperform baselines in both mean and risk-sensitive performance (1%-CVaR) across wide dataset regimes (Laroche et al., 2017, Nadjahi et al., 2019, Scholl et al., 2022).
Continuous Control and Deep RL: Model-free variants (SPIBB-DQN) and end-to-end deep extensions (DeepSPI) have effectively scaled SPI to high-dimensional settings (e.g., helicopter navigation, ALE-57 Atari suite), indicating scalability and robustness (Laroche et al., 2017, Delgrange et al., 14 Oct 2025).
Healthcare and Operations: Real-world case studies demonstrate the crucial deployment value of explicit safety guarantees in medical policy improvement, crop management, energy arbitrage, and industrial systems (Petrik et al., 2016, Satija et al., 2021, Simão et al., 2019, Cho et al., 17 Mar 2025).
Safe Exploration and Navigation: Hierarchical or shielding-based SPI admits efficient navigation and manipulation under strict safety constraints; empirical results in robot and vehicle simulation environments show significant improvements in collision rates and task success (Angulo et al., 25 Aug 2024, Anderson et al., 2022).

Across all such domains the central insight persists: by restricting deviation to well-established regions of the state–action space or by leveraging uncertainty sets and robust estimators, SPI approaches combine safety with practical, often substantial, improvement over naïve baseline adherence.

6. Limitations, Open Questions, and Future Directions

While SPI methods have advanced guarantees under a variety of settings, limitations remain:

Conservatism vs. Aggressiveness: Hardline restriction to baseline actions can be overly conservative, limiting improvement. Soft relaxation improves flexibility but may require careful calibration and additional “advantageous” constraints to remain provably safe (Nadjahi et al., 2019, Scholl et al., 2022, Scholl et al., 2022).
Sample Complexity: Many bounds are only meaningful (“non-vacuous”) at large sample sizes or in low-dimensional settings. Logarithmic improvements in complexity (e.g., via 2sMDPs (Wienhöft et al., 2023)) are promising, yet still tied to explicit enumeration or bounding of uncertainty sets.
Robustness to Model Misspecification: The guarantees often depend on accurate estimation of baseline policy, transition kernels, or model error. Estimation failure can yield vacuum guarantees or unexpected degradation (Simão et al., 2019).
Scalability to Continuous and Structured Domains: Recent works demonstrate advances (e.g., DeepSPI, parametric SPI (Delgrange et al., 14 Oct 2025, Engelen et al., 21 Jul 2025)), but device- and application-specific enhancements may be required for tight guarantees in large-scale and structured MDPs.
Tradeoff Between Performance and Risk: Empirical taxonomy studies demonstrate that more aggressive, uncertainty-penalizing methods yield higher average return but increased risk of failure, while strict policy-restriction methods sacrifice mean performance for improved worst-case (risk-sensitive) metrics (Scholl et al., 2022, Scholl et al., 2022).

Research continues to develop tighter bounds, improved estimation, relaxed constraints for greater achievable improvement, and generalization to broader families of RL settings including non-stationary, partially observed, multi-objective, and real-world domains where SPI’s assurances are indispensable.