Safety Optimal Transport: A Unified Safety Framework
- Safety Optimal Transport is a framework that aligns an agent’s behavior distribution with risk profiles using Wasserstein distances and optimal transport plans.
- It integrates with algorithms like Q-learning, SARSA, and robust Bellman backups to systematically penalize unsafe states and improve risk control.
- Empirical results show significant safety improvements, reducing unsafe state incidences by 30-50% in both reinforcement learning environments and large language model fine-tuning.
Safety Optimal Transport (SOT) is a principled framework that leverages optimal transport theory to enforce safety and robustness in learning systems by aligning the empirical distribution of agent behavior or data with a reference safety profile. SOT applies the geometry-aware properties of Wasserstein distances and transport plans to quantify and actively manage the distributional relationship between actual outcomes and desired safety constraints. The framework spans reinforcement learning, robust control, and safe fine-tuning of LLMs, providing unified mathematical mechanisms to shape the behavior of learning agents toward verified safe regions and away from harmful states or data modalities.
1. Mathematical Foundations and Core Principles
SOT formalizes safety as a distributional alignment problem between an agent's policy-induced state visitation (or empirical data) distribution and a predefined risk or safety reference. Central to this approach is the Wasserstein distance: where is the stationary distribution from an agent policy , is a domain expert-provided risk distribution, and is a ground cost (Shahrooei et al., 2024). SOT augments classic learning objectives by penalizing or constraining this distance: Maximizing creates a tradeoff between task performance and geometric proximity to safe regions. In actor-critic settings, entropic regularizations (Sinkhorn divergences) are frequently used for computational tractability on large samples (Baheri, 2023).
2. Algorithms and Implementation
SOT instantiations span several algorithmic paradigms:
- Q-Learning and Temporal Difference Learning: SOT modifies policy updates by adding transition-level penalties computed from optimal transport plans between empirical and risk distributions. In Q-learning, the update is: where reflects the OT transport cost for each transition (Shahrooei et al., 2024).
- Safety-Guided SARSA: Discrete action uncertainty penalties are constructed using entropy-regularized OT between Q- and target value distributions: Actions are selected via an -greedy policy over , biasing the agent toward predictable, low-risk outcomes (Shahrooei et al., 22 Feb 2025).
- Robust Bellman Backups: In deep RL, SOT replaces the nominal state-transition kernel with an OT uncertainty ball centered at the empirical kernel, yielding worst-case robust Bellman backups via adversarial perturbations in state-space, often computed in dual form for sample efficiency (Queeney et al., 2023).
- LLM Fine-Tuning with Push–Pull Distributional Alignment: SOT learns importance weights over dataset samples via dual-reference OT—pulling the empirical measure toward verified-safe anchors and repelling it from catalogs of harmful prompts: Weights guide the selection and weighting of examples during fine-tuning, ensuring the downstream distribution sits within a "robust geometric safety boundary" (Wang et al., 12 Jan 2026).
3. Theoretical Guarantees and Safety Properties
SOT delivers several provable safety benefits:
- Risk Reduction: For sufficiently large penalty or coefficient , OT-based regularization or constraints provably reduce the steady-state probability of unsafe state visits (Baheri, 2023, Shahrooei et al., 22 Feb 2025). This is formalized via contraction proofs and stationary flow arguments, ensuring that mass is concentrated in safe regions and away from hazards.
- Robustness to Model Uncertainty: The OT-based uncertainty sets employed in robust RL ensure that any feasible policy satisfies safety constraints under worst-case perturbations of the transition kernel, with performance degradation bounded as a Lipschitz function of the OT ball radius (Queeney et al., 2023).
- Pareto Trade-off: SOT achieves a quantifiable trade-off between reward optimality and risk aversion—higher regularization coefficients or yield policies that achieve lower OT divergence but at possible cost to raw return, formalized via monotonic risk-aversion theorems (Baheri, 2023).
- Convergence: In tabular RL, SOT-augmented TD updates remain -contractive and possess unique fixed points, ensuring almost sure convergence under standard step-size and exploration schedules (Shahrooei et al., 2024, Shahrooei et al., 22 Feb 2025).
4. Empirical Performance Across Domains
SOT methods demonstrate substantial empirical gains:
Reinforcement Learning:
- In standard benchmark environments such as Gridworld, cliff walking, Cartpole, and continuous control, SOT methods yield higher asymptotic returns and substantial reductions in unsafe state visitation (e.g., 30–50% fewer failures or collisions vs. baselines) (Shahrooei et al., 2024, Shahrooei et al., 22 Feb 2025, Queeney et al., 2023).
- Robust SOT RL satisfies safety in 87% of perturbed test cases (vs. 51% standard RL), with mild or no compromise in average reward (Queeney et al., 2023).
LLM Fine-Tuning:
- SOT-based push–pull reweighting achieves the lowest Harmfulness Scores and competitive Helpfulness/Accuracy across four downstream tasks and three model families, e.g., reducing harmful response rates from 0.426 (standard SFT) to 0.197 on SLIMORCA (Wang et al., 12 Jan 2026).
5. Extensions, Scalability, and Limitations
SOT is adaptable and extensible:
- Continuous Spaces: Wasserstein distances and OT plans admit closed forms in cases (e.g., Gaussian mixtures), facilitating action-wise penalties in RL (Shahrooei et al., 22 Feb 2025).
- Safe Multi-Agent Coordination: SOT can penalize joint occupancy against multi-agent risk distributions (Baheri, 2023).
- Large-Scale Data: Sinkhorn iterations and sub-sampling enable feasible computation on datasets up to – points in LLM fine-tuning (Wang et al., 12 Jan 2026).
- Generic Safety Constraints: By augmenting OT ground costs to reflect constraint violations or safety budgets, SOT supports broad classes of constrained MDPs (Queeney et al., 2023).
Limitations include dependence on accurate risk profiles, computational cost for repeated OT solves, vulnerability under severe partial observability, and reliance on reference datasets in LLM alignment. Transport plans lack per-instance interpretability, which is an open problem for transparency.
6. Geometric Interpretation and Conceptual Significance
SOT’s foundational innovation is geometric: by quantifying how much mass must be moved, and where, for alignment with safety requirements, SOT creates safety-aware learning mechanisms superior to per-instance heuristics or appearance-based regularization. The "push–pull" instance for LLMs formalizes purification of downstream distributions via simultaneous attraction to verified-safe anchors and repulsion from known hazardous clouds, establishing robust safety boundaries in embedding space (Wang et al., 12 Jan 2026).
Across RL and supervised settings, this framework enables explicit control of the distributional position of the agent or model relative to safety-defined regions, yielding policies and models whose long-run behavior can be certified and tuned for diverse domains such as robotics, autonomous vehicles, healthcare, and language safety.
7. Outlook and Research Directions
Current SOT research explores:
- Online or amortized OT computation for high-dimensional or continuous environments.
- Integration with actor-critic and policy-gradient architectures via Kantorovich duals and automatic differentiation (Baheri, 2023).
- Extensions to multi-modal safe references, dynamic regularization schedules, and incorporation of adversarial attacks during training (Wang et al., 12 Jan 2026).
- Applications in domains with known or learnable risk maps, including safe offline RL, coordinated multi-agent safety, and robust LLM alignment.
A plausible implication is that SOT provides a unifying geometric perspective for safety alignment, distributional robustness, and risk-aware learning, with ongoing work required to scale interpretability and reference set reliability for mission-critical deployments.