Safe Improvement Relations

Updated 3 December 2025

Safe improvement relations are formal criteria ensuring that new policies do not perform worse than baseline policies under uncertainty using probabilistic guarantees.
They employ statistical hypothesis testing, bootstrapping, and robust optimization to certify performance improvements in sequential decision-making.
Applications include reinforcement learning, econometrics, and mechanism design for safe, data-efficient policy updates in dynamic environments.

A safe improvement relation formalizes the requirement that, under uncertainty and data limitations, a newly proposed policy (or mechanism, or threshold, etc.) will not degrade performance compared to a reference or baseline policy, subject to user-specified risk or error tolerances. This principle is foundational for high-stakes sequential decision-making in reinforcement learning, econometrics, mechanism design, and constrained optimization, particularly when policies must be improved using off-policy data or under non-stationary dynamics.

1. Formal Definitions of Safe Improvement Relations

A safe improvement relation defines a probabilistic or adversarial guarantee that a candidate policy $\pi$ satisfies

$\Delta(\pi, \pi_b) := V(\pi) - V(\pi_b) \ge 0$

with respect to a baseline policy $\pi_b$ , either in expectation, or with high probability ( $1-\delta$ ), or uniformly across all possible environments that satisfy specified uncertainty bounds.

In non-stationary MDPs, the relation is instantiated as

$\Delta(\pi, \pi_b) \equiv V_\delta(\pi) - V_\delta(\pi_b) \ge 0$

at a future time $k+\delta$ , where $V_\delta(\pi)$ is the expected return of $\pi$ at horizon $\delta$ and the guarantee is

$\Pr\left[ V_\delta(\pi) - V_\delta(\pi_b) \ge 0 \right] \ge 1-\alpha$

ensured via sequential hypothesis testing and wild bootstrap confidence intervals (Chandak et al., 2020).

In batch RL and online RL, the safe improvement relation requires

$P\left[ \rho(\pi') \ge \rho(\pi) \right] \ge 1-\delta$

for every policy deployment step, with $\rho(\pi)$ denoting true expected return (Cohen et al., 2018).

The variant for constrained Markov Decision Processes is

$\forall i\quad V_{C_i}^{\pi'} \le V_{C_i}^{\pi} \qquad \Pr \ge 1-\delta$

where $C_i$ are cost functions encoding safety violations (Berducci et al., 2022).

In general, the relation may be formulated as minimizing robust regret: $\max_{\pi} \min_{P \in U} \left[ V^\pi(P) - V^{\pi_b}(P) \right]$ where $U$ is an uncertainty set for MDP transition kernels (Petrik et al., 2016).

2. Statistical and Optimization Frameworks

Safe improvement is realized via statistical hypothesis testing, confidence intervals, bootstrapping, and robust optimization:

Sequential Hypothesis Testing: Improvement is accepted only if a lower confidence bound for forecasted policy performance exceeds an upper bound for the baseline, at overall significance level $\alpha$ , typically using wild bootstrap for non-i.i.d. residuals (Chandak et al., 2020).
Student’s t-Test and High-Confidence Bounds: For offline RL, safe improvement of $\pi'$ over $\pi_b$ is certified via

$L(\pi') = \bar{X} - t_{1-\delta, N-1}\frac{S}{\sqrt{N}}$

where $L(\pi')$ is a high-confidence lower bound on performance and $t$ is the appropriate quantile (Cohen et al., 2018).

Multiple Testing Corrections: When evaluating many candidates, family-wise error rate or false discovery rate is controlled via Benjamini–Hochberg or sup- $t$ procedures (Cho et al., 21 Aug 2024).
Bootstrapped or Soft Constraints: In Safe Policy Improvement with Baseline Bootstrapping (SPIBB) and Soft-SPIBB, policy update is constrained such that in uncertain (low-count) state-action pairs, the new policy exactly or softly mimics the baseline, with constraints scaling according to estimated local uncertainty (Nadjahi et al., 2019, Scholl et al., 2022).
Robust Regret Minimization: In model-based safe RL, the policy is improved only where model accuracy is high and falls back to baseline otherwise, minimizing worst-case regret over an uncertainty set determined by model error bounds (Petrik et al., 2016).

3. Algorithmic Realizations

Safe improvement relations are enforced in algorithms via specific policy selection and update steps:

Algorithm	Safe Improvement Criterion	Mechanism
SPIN (Chandak et al., 2020)	$\Pr[\Delta(\pi, \pi_b) \ge 0] \ge 1-\alpha$	Hypothesis test/wild bootstrap over forecasted returns
DE (Cohen et al., 2018)	$L(\pi') \ge \rho_\ell$ with $1-\delta$ confidence	Multi-policy deployment, HCOPE/t-test
SPIBB (Nadjahi et al., 2019)	$V^{\pi} \ge V^{\pi_b} - O(\sqrt{\ln \|S\|\|A\|/\delta}/N_\wedge)$	Hard or soft bootstrapping, local constraints
DPRL (Sharma et al., 12 Oct 2024)	$\Pr[\rho(\pi^{DP})-\rho(\pi_b) \ge 0] \ge 1-\delta$	Restrict policy search to well-visited state-actions
Robust Baseline Regret (Petrik et al., 2016)	$\min_{P \in U} [V^\pi(P) - V^{\pi_b}(P)] \ge 0$	Worst-case regret optimization
CSPI–MT (Cho et al., 21 Aug 2024)	$\Pr(V(\pi(c)) < V(\pi(c_0))) \le \alpha$	Sup- $t$ simultaneous confidence bands

The algorithmic enforcement typically interleaves candidate search, high-confidence safety verification, and conditional policy deployment or retention of the baseline.

4. Theoretical Guarantees and Bounds

The safe improvement relation is accompanied by non-asymptotic and asymptotic theorems, typically of the following form:

Asymptotic Safety: Under mixing/smoothness and support assumptions, the probability of deploying a policy with degraded performance approaches $\alpha$ as the number of episodes or data samples increases (Chandak et al., 2020).
Finite-Sample Bounds: Explicit formulas bound the acceptable performance drop (regret, $\zeta$ , or $\epsilon$ ), scaling with the inverse square root of sample size, the number of states and actions, and model uncertainty. For example,

$V^{\pi}_{M^*}(x_0) \ge V^{\pi_b}_{M^*}(x_0) - \zeta$

with

$\zeta = \frac{4 R_{\max}}{(1-\gamma)^2} \sqrt{ \frac{2 \ln(2 |S| |A| 2^{|S|} / \delta) }{ N_\wedge } }$

(Nadjahi et al., 2019). Transformations to two-successor MDPs exponentially reduce sample complexity for guaranteeing safe improvement (Wienhöft et al., 2023).

Data-Dependent Bounds: In algorithms such as DPRL, penalty terms in the safe improvement bound depend only on the number of observed high-count state-action pairs, not the global dimensions of the state or action space (Sharma et al., 12 Oct 2024).
Multiple Objective and Guardrail Extensions: SNPL generalizes the safe improvement relation to multiple outcomes, requiring all specified guardrails to avoid regression with high confidence (Cho et al., 17 Mar 2025).

5. Extensions to Non-Stationary, Constrained, and Multi-Agent Settings

Safe improvement relations extend to:

Non-Stationary MDPs: SPIN applies time-series analysis, forecasting, and wild-bootstrap sequential testing to certify safe improvement where the environment dynamics evolve smoothly over time (Chandak et al., 2020).
Constrained RL: Safe improvement is enforced for multiple safety cost functions simultaneously, with each policy update guaranteed not to increase violation rates for any constraint (Berducci et al., 2022).
Multi-Objective Experiments: SNPL provides safe multi-objective policy improvement, supporting non-regression on any number of “guardrails” while seeking improvement on a “goal” outcome (Cho et al., 17 Mar 2025).
Mechanism Design: Binary constraint structures and outcome correspondences allow comparison of mechanisms ("games") under equilibrium ambiguity; a modified mechanism G' is a safe improvement over G if every outcome of G' Pareto-dominates that of G, under all consistent outcome correspondences (Oesterheld et al., 26 Nov 2025).

6. Empirical Validation and Applications

Safe improvement relations are validated empirically across domains:

Domain-Adaptive Control: In RET optimization, SPIBB policies strictly outperform baselines on network KPIs, with high-confidence safety even in low-data regimes (Vannella et al., 2020).
Exploration and Data Efficiency: Diverse Exploration accelerates policy optimization by simultaneously deploying multiple safe policies, increasing exploration entropy and convergence speed while maintaining safety (Cohen et al., 2018).
Sample Complexity: Two-successor and beta-bound SPI improvements dramatically decrease data requirements for safety, enabling faster practical convergence (Wienhöft et al., 2023).
Batch RL in Healthcare, Dialogue, and Advertising: Algorithms leveraging safe improvement relations deliver improvements in synthetic and real datasets (Atari, GridWorld, MIMIC-III, SMS personalization) under stringent safety criteria (Sharma et al., 12 Oct 2024, Cho et al., 17 Mar 2025, Ramachandran et al., 2021).

7. Limitations and Open Problems

While safe improvement relations are widely applicable, several limitations persist:

Sample Complexity: Achieving non-vacuous safe improvement guarantees in high-dimensional domains may require very large datasets or conservative constraints, potentially reducing practical improvement (Scholl et al., 2022).
Complexity of Verification: Checking universal safe improvement (e.g. Pareto-dominance for all satisfying assignments in mechanism design) is co-NP-complete under arbitrary outcome correspondences; completeness of inference rules is only ensured under special structure (e.g., max-closed binary constraint systems) (Oesterheld et al., 26 Nov 2025).
Assumption Sensitivity: Statistical guarantees depend critically on support and mixing assumptions; e.g., off-support policies cannot usually be certified as safely improving.
Trade-Offs: Tuning constraint thresholds, error budgets, and the conservativeness–improvement trade-off remains domain- and user-specific.

This area continues to evolve, with new methodologies seeking sharper finite-sample guarantees, reductions in sample complexity, and more generalizable mechanisms for enforcing and verifying safe improvement relations in sequential decision-making systems.