Learnable Constraints for Long-Term Safety

Updated 4 March 2026

Learnable constraints offer data-driven, parameterized safety specifications derived from expert demonstrations, language directives, and on-policy data to guide agent behavior.
They integrate into reinforcement learning via constrained policy optimization, projection methods, and barrier certificates to maintain safety over extended horizons.
Empirical validations in simulations and autonomous driving show that these methods can achieve expert-level safety while retaining robust task performance.

Learnable constraints for long-term safety are data-driven, parameterized safety specifications synthesized and enforced within reinforcement learning (RL) or control frameworks to guarantee or maintain agent behavior within acceptable risk boundaries over extended horizons, even in the face of unknown system dynamics, nontrivial temporal specifications, or evolving environments. The central idea is to infer, represent, and deploy safety constraints from human demonstrations, on-policy data, high-level domain knowledge (e.g., logic or language), or direct supervisory signals—enabling agents to avoid unsafe actions without explicit, hand-crafted safety models.

1. Principles and Problem Formulation

Long-term safety requirements in RL and control are typically formalized as constraints on the cumulative cost or risk collected by an agent interacting in a stochastic environment. In the standard CMDP framework:

State space $\mathcal{S}$ , action space $\mathcal{A}$ , transition kernel $P(s'|s,a)$ , reward $r(s,a)$ , discount $\gamma\in(0,1)$ .
Cost/constraint function $c^*(s,a)$ (possibly unknown or temporally extended).
Safety budget $d$ .

The learning objective is to synthesize control or policy $\pi$ to maximize expected reward subject to long-term safety, e.g.,

$\max_\pi \mathbb{E}_\pi\left[\sum_t \gamma^t r(s_t,a_t)\right]\ \text{such that}\ \mathbb{E}_\pi\left[\sum_t \gamma^t c^*(s_t,a_t)\right] \leq d$

The key challenge is that the true $c^*$ is often unavailable or difficult to specify. Learnable constraint frameworks seek to recover $c^*$ (or a suitable surrogate) from limited data or side information, and integrate the learned constraint into the policy optimization loop to guarantee long-term safety (Baert et al., 2023, Papadopoulos et al., 27 Feb 2026).

2. Data-Driven Learning of Safety Constraints

Approaches to learnable constraints differ primarily in source of supervision, representational form, and integration mechanism.

a. Learning from Demonstrations

Given safe expert demonstrations $D = \{\tau_i\}$ —trajectories generated by an (unknown) safe policy $\pi_E$ —the goal is to estimate a constraint set $C \subseteq \mathcal{S}\times\mathcal{A}$ such that all demonstrated (state, action) pairs are feasible. Techniques include:

One-class classifiers: One-class decision trees or SVMs trained on features $\phi(s,a)$ from demonstrations delineate the “safe set” in feature space as a union of rectangle leaves; the logical DNF for the safe set is explicitly extracted and reused as a symbolic safety rule (Baert et al., 2023).
Density estimation and boundary extraction: Recovery of compact constraint boundaries via kernel density or maximum-entropy techniques.

b. Adversarial and Multi-task IRL Extensions

The inverse constraint learning paradigm adversarially seeks the “tightest” constraint consistent with high reward (or minimal cost) behavior: “forbid everything the expert could have done but did not,” subject to not being overly conservative—a problem strongly mitigated by pooling demonstrations from different tasks to tighten the feasible set (Kim et al., 2023).

c. Temporal Logic and Non-Markovian Constraint Learning

Many safety properties are naturally non-Markovian (e.g., “never visit unsafe set within any 10-step window”) and are expressed as temporal logic or automata over trajectories (Yifru et al., 2023, Quint et al., 2019, Low et al., 2024). These frameworks either:

Synthesize symbolic constraints (e.g., parametric STL) via evolutionary search or Bayesian optimization, using misclassification error on labeled traces as the objective (Yifru et al., 2023, Yifru et al., 2024).
Encode logical safety automata as finite-state machines augmenting the MDP, enabling integration into policy optimization with state augmentation and cost shaping (Quint et al., 2019).

d. Learning from Language or High-level Descriptions

Constraints can be derived from free-form natural language (NL) directives or human-in-the-loop feedback by mapping text instructions or rules to cost surrogates using pre-trained LLMs, neural mapping architectures, and auxiliary contrastive losses (Chua et al., 4 Apr 2025, Lou et al., 2024). Models are trained to predict per-state or per-trajectory violation probabilities, plugging these into the RL cost channel.

3. Algorithmic Integration with Policy Optimization

Once a constraint surrogate $\hat{C}(s,a)$ is learned, an effective enforcement mechanism is required. Major strategies include:

Constrained policy optimization (Lagrangian or trust-region): Transform learned constraints into cost surrogates $c_{\phi}(s,a)\in[0,1]$ and proceed with constrained policy gradients, primal-dual actor-critic, or trusted policy optimization (e.g., PPO-Lagrangian, TRPO stabilization) (Baert et al., 2023, Chua et al., 4 Apr 2025, Yoo et al., 30 Jan 2025).
Projection and masking: Mask unsafe actions by restricting $\pi(\cdot|s)$ to those $a$ satisfying $\hat{C}(s,a)\leq 0$ (or $Q^{safe}(s,a)\leq \epsilon$ threshold in safety-critic approaches), effectively projecting exploration into the safe region (Srinivasan et al., 2020).
Barrier and safety measures: Use value-function-based barrier certificates (e.g., CBF, LDCBF, state-action CBFs) learned from data to construct safety filters or QP-projected action selectors (Ohnishi et al., 2019, He et al., 21 May 2025). Recursive feasibility and error-to-state safety bounds are established under approximation error.
Dual constraint tracking: Simultaneously enforce long-term (expected cost) and short-term (e.g., learned state classifier) safety through multiple Lagrange multipliers and deep validation networks (Hu et al., 2024).

4. Representational Forms and Interpretability

Learned constraints may be encoded in multiple forms:

Logical formulas: Explicit DNF rules or linear thresholds over features, offering interpretability and transparent transfer to other agents or domains (Baert et al., 2023).
Neural surrogates: Deep networks outputting per-state, per-action, or per-trajectory cost or violation probabilities—flexibly express complex or non-Markovian dependencies but may be less transparent (Chua et al., 4 Apr 2025, Low et al., 2024).
Automata and formal languages: Safety automata, STL, or DFA formulations, supporting rigorous specifications and efficient runtime checks (Quint et al., 2019, Yifru et al., 2023).
Distributional constraint models: Posterior over safety-indicator random variables (e.g., Beta distributions produced by neural nets) enable risk sensitivity (via CVaR) and adaptation to changing risk appetite (Yoo et al., 30 Jan 2025).

A central practical advantage of learning constraints from data is improved interpretability and opportunity for cross-task/cross-agent transfer, observed in experiments with rule transfer and constraint reusability in simulation and driving tasks (Baert et al., 2023, Kim et al., 2023).

5. Guarantees and Empirical Validation

Multiple frameworks provide both theoretical and empirical assurances of constraint satisfaction and long-term safety:

Rigorous Safety Guarantees: Under realizability and sufficient demonstration coverage, the learned constraint set $C$ can ensure positive invariance of the viable set and thus (with high probability) avoidance of failure states indefinitely (Massiani et al., 2021, Heim et al., 2019). PAC-style sample complexity and regret bounds characterize the learning requirements.
Empirical results: Across MuJoCo, Safety Gym, autonomous driving, grid-world, and natural-language navigation, learned-constraint RL agents consistently:
- Achieve ground-truth or near-expert levels of safety performance (collision, violation rate, cumulative cost).
- Maintain task reward close to (or above) unconstrained or IRL baselines.
- Robustly generalize to new domains and tasks when constraints are transferred, retrained, or adapted (Baert et al., 2023, Kim et al., 2023, Yoo et al., 30 Jan 2025, Chua et al., 4 Apr 2025, Günster et al., 2024).
Bounding out-of-support behavior: Some frameworks (e.g., Safe QIL) specifically ensure that out-of-support state-action value functions remain pessimistic, bounding the value of novel, potentially unsafe behaviors (Papadopoulos et al., 27 Feb 2026).
Handling non-Markovian and long-horizon safety needs: Temporal-logic-based and non-Markovian safety models enable the enforcement of durability properties (“never fail within $k$ steps” or “always avoid a pattern”), with explicit guarantees or empirical control of constraint violation rates (Low et al., 2024, Yifru et al., 2023).

6. Limitations, Adaptivity, and Open Challenges

Learnable constraint methods, while powerful, are subject to several caveats:

Reliance on demonstration quality and coverage: Poorly representative demos bias the feasible set, potentially either excluding safe behaviors or overestimating risk. Mitigating conservatism requires multi-task demonstration pooling, active learning, or explicit exploration of constraint boundaries (Kim et al., 2023, Massiani et al., 2021).
Constraint miscalibration and model error: Errors in learned safety surrogates may render safety assurance probabilistic rather than absolute. Recent work introduces error-to-state feasibility analyses and explicit set tightening to formally account for approximation error (He et al., 21 May 2025, Chekan et al., 2022).
Non-convexity and scalability: High-dimensional or non-convex constraint sets (e.g., neural surrogates) may pose significant computational challenges for real-time projection or filtering.
Language and logic mapping limitations: Natural language-to-cost surrogates are limited by model pre-training and quality of the NL-to-cost mapping; difficult, compositional rules can be missed (Chua et al., 4 Apr 2025, Lou et al., 2024).

Adaptive and meta-learning approaches address shifts in constraint requirements and environment dynamics, providing initialization or rapid adaptation mechanisms for changing safety specifications (Cho et al., 2023, Günster et al., 2024, Yoo et al., 30 Jan 2025).

7. Future Directions and Outlook

Ongoing research aims to:

Scale learnable constraints to highly complex, stochastic, and partially observed domains.
Integrate active data acquisition, counterexample-driven refinement, and Bayesian risk budgeting to reduce conservatism while meeting strict safety demands.
Hybridize symbolic and data-driven constraint learning, combining interpretable high-level logic with flexible neural surrogates.
Bridge offline training and real-world online enforcement, ensuring coverage for rare but catastrophic risks.

Collectively, these advances position learnable constraints as a foundational tool for real-world safe RL and control, enabling agents to achieve durable safety guarantees from partial knowledge and dynamic environments (Baert et al., 2023, Papadopoulos et al., 27 Feb 2026, Yoo et al., 30 Jan 2025, Günster et al., 2024, Massiani et al., 2021, Kim et al., 2023, Srinivasan et al., 2020).