Continuous-Time Robust Reinforcement Learning

Updated 14 October 2025

Continuous-Time Robust Reinforcement Learning is a framework that models system dynamics via differential equations and probability measures to manage uncertainty.
It integrates dynamic programming through the HJB equation with Bayesian updates to balance exploration and exploitation in control tasks.
The framework offers theoretical guarantees including continuity of the value function, existence of optimal controls, and a differential inclusion for belief evolution.

A continuous-time robust reinforcement learning (RL) framework is a mathematical and algorithmic paradigm for sequential decision-making under uncertainty, in which the dynamics of a system, the transition structure, or the cost function are modeled directly in continuous time, and robustness to various forms of uncertainty—such as model misspecification, noise, or adversarial perturbations—is achieved through principled approaches including Bayesian learning, distributional analysis, and dynamic programming. In contrast to discrete-time RL, continuous-time frameworks use differential equations (notably the Hamilton–Jacobi–Bellman (HJB) equation) and time-evolving probability measures to account for both control and learning in environments where infinitesimal evolutions and information flow matter. The resulting theory and algorithms form the basis for robust learning, safe exploration, and control in fields such as autonomous systems, robotics, finance, and model-based science.

1. Uncertainty Modeling in Continuous-Time State Dynamics

Continuous-time robust RL frameworks typically begin by representing system uncertainty as a time-varying probability measure over possible dynamics functions. Let $f:\mathbb{R}^n \times \mathbb{R}^m \to \mathbb{R}^n$ denote the (unknown) system drift. The agent’s knowledge about $f$ at time $t$ is captured by a Radon probability measure $\pi(t)$ on a compact set $X$ of admissible dynamics. For any subset $E \subset X$ , $\pi(t)(E)$ gives the agent's belief that $f \in E$ .

A crucial local learning assumption is imposed: in any neighborhood $B(\tilde{x})$ of the current state $\tilde{x}$ , $\pi(t)$ concentrates all its mass on the true dynamics $\tilde{f}$ , i.e.,

$\pi(t)\big(\{f : f(x,u) = \tilde{f}(x,u)\ \forall x\in B(\tilde{x}), \forall u \in U\}\big) = 1,$

ensuring perfect identification of $f$ in regions already visited. This formulation generalizes classical adaptive control and Bayesian RL, supporting learning and adaptation as $\pi(t)$ is refined from interaction data over time.

2. Continuous-Time Optimal Control Formulation and Value Functions

The agent’s objective is to minimize the infinite-horizon discounted cost:

$\min_{u} \int_0^\infty e^{-\lambda s} J(x(s), u(s))\,ds,$

where $J(\cdot,\cdot)$ is the running cost and $\lambda > 0$ the discount rate. Given model uncertainty, the cost-to-go is extended to

$W(t,s,x) = \inf_{u\in\mathcal{U}} \int_s^\infty \int_X e^{-\lambda \tau} J\big(x^f(\tau), u(\tau)\big) \,\pi(t)(df)d\tau,$

which averages over all candidate dynamics. The canonical value function becomes

$V(t,x) := W(t,0,x).$

This architecture unifies the cost of control and the epistemic uncertainty, naturally generalizing the value functions of both adaptive control and Bayesian RL.

3. Dynamic Programming Principle and HJB Equation under Uncertainty

The core recursive structure is formalized by a dynamic programming principle (DPP), leveraging local full-knowledge regions:

$V(t, \tilde{x}) = \inf_{u\in\mathcal{U}} \left\{ \int_0^h e^{-\lambda s} J\big(x^{(\tilde{f})}(s), u(s)\big) ds + e^{-\lambda h} V\left(t, x^{(\tilde{f})}(h,u)\right) \right\},$

where $x^{(\tilde{f})}(h,u)$ is the state reached after time $h$ , starting from $\tilde{x}$ with control $u$ , under the (locally known) true dynamics $\tilde{f}$ .

The associated HJB equation, using viscosity solution theory, is:

$\lambda V(t, y) = \inf_{u\in U} \Big\{ J(y,u) + \nabla V(t,y) \cdot \tilde{f}(y,u) \Big\},$

which, despite the global model uncertainty, is locally written in terms of the now-known $\tilde{f}$ . This equation is pivotal: it merges the logic of feedback control with learning and adaptation of unknown system dynamics.

4. Exploration–Exploitation Trade-off and Intrinsic Exploration

The time-evolving measure $\pi(t)$ operationalizes the exploration–exploitation trade-off:

Exploitation: In already-visited regions where $\pi(t)$ is concentrated, the agent acts optimally under the (locally identified) dynamics $\tilde{f}$ .
Exploration: In unvisited or uncertain regions, $\pi(t)$ remains diffuse, so the expected cost integrates over multiple plausible $f$ . The agent is incentivized to select control actions that traverse such regions, gathering information to refine $\pi(t)$ and thus reduce future uncertainty.

Unlike ad hoc exploration bonuses or randomization, this mechanism intrinsically emerges from the probabilistic model of system dynamics and the DPP.

5. Existence, Regularity, and Theoretical Guarantees

The framework establishes several foundational properties:

Existence of optimal controls: Under mild conditions, optimal relaxed controls exist, and relaxed and standard controls yield the same value function.
Continuity and regularity: The value function $V(t,x)$ is shown to be Lipschitz continuous in both $x$ and $t$ (given further regularity of $\pi(t)$ ), ensuring well-posedness and supporting numerical approximation.
Differential inclusion for learning evolution: The temporal evolution of $\pi(t)$ introduces a differential inclusion that connects changes in the agent’s belief with the gradient of the value function, providing insight into the dynamics of learning vs. acting (see Theorem 3.3 in the paper).

These results empower the use of robust numerical solvers and underpin future algorithmic development.

6. Connections with Bayesian RL, Adaptive Control, and Implications for Algorithm Design

By modeling uncertainty in the system’s drift as a time-dependent measure and continuously updating beliefs based on local observations, the framework subsumes Bayesian RL, adaptive control, and robust RL methods as special cases. The agent's policy at each instant is "optimistic"—it minimizes expected costs based on current beliefs while continuously revising those beliefs by interacting with the environment.

This explicit, mathematical unification clarifies the underlying mechanics of exploration, safe learning, and robust control in nonstationary, uncertain systems. The derivation of the DPP and HJB in this fashion offers a rigorous template for developing new robust RL and adaptive optimal control algorithms, including those capable of functioning under model misspecification and dynamically changing environments.

7. Impact and Future Directions

The formulation enables a rigorous paper of fundamental issues:

How beliefs about system dynamics are refined over time by local measurements, affecting both short-term control and long-term learning.
How the inherent coupling between the learning process and feedback control can be exploited to analyze and design RL algorithms with robust performance guarantees.
How local full-knowledge assumptions can be systematically weakened or replaced with more general information structures.

Prospective avenues include algorithm design for finite-data regimes, scalable Bayesian methods for high-dimensional state-action domains, and extensions to stochastic systems, all grounded in the strong theoretical foundation established by the dynamic programming and HJB apparatus.

This framework represents a significant step toward principled robust RL methods that remain effective even in the presence of deeply uncertain or adversarial dynamics, providing mathematical formalisms and guarantees essential for deployment in real-world autonomous systems and safety-critical applications (Murray et al., 2018).

PDF Markdown Chat (Pro)

References (1)

A model for system uncertainty in reinforcement learning (2018)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Continuous-Time Robust Reinforcement Learning Framework.