Time-Aware Actor-Critic Model

Updated 14 November 2025

Time-Aware Actor-Critic Models are reinforcement learning approaches that explicitly integrate temporal features—such as continuous time indices and temporal abstraction—to improve decision making.
They address the limitations of standard RL by modeling time through controlled SDEs/ODEs, action repetition, and free terminal times, enabling robust performance in irregular sampling scenarios.
These models provide strong theoretical guarantees and have been successfully applied in continuous control, stochastic optimal control, and model-based RL with Bayesian neural ODEs.

A time-aware actor-critic model is a reinforcement learning approach that incorporates explicit representation and handling of temporal features and temporal abstraction within the actor-critic learning paradigm. Such models are distinguished by their handling of time as either a continuous variable (via stochastic or deterministic dynamical systems) or as an intrinsic component of policy structure (via action repetition or free terminal times), enabling improved performance and theoretical guarantees in continuous control, stochastic optimal control, and temporally abstract decision making.

1. Temporal Structure in Actor-Critic Frameworks

Time-aware actor-critic models address a fundamental limitation of canonical RL algorithms that assume either discrete-time Markov decision processes or do not allow for temporal abstraction in control. Standard off-policy actor-critic methods, such as DDPG or SAC, update actions at every discrete environment step and do not intrinsically model temporal dependencies or allow for action persistence. Time-aware variants instead introduce temporal features such as continuous time indices, time-marginal state densities, explicit modeling of controlled SDEs or ODEs, action-repetition policies, or state-dependent horizon structures. These advances allow the agent to better exploit the temporal structure of the control problem, accommodate irregular time sampling, and directly engage with the underlying physics of many real-world systems.

2. Time-Continuous Stochastic Actor-Critic Flows

A paradigmatic example of time-awareness in actor-critic methods is the continuous-time stochastic actor-critic flow developed for stochastic optimal control problems (Zhou et al., 27 Feb 2024). The problem is formulated in terms of controlled stochastic differential equations: $dX_t = b(X_t, u_t)\,dt + \sigma(X_t, u_t)\,dW_t,$ where $X_t \in \mathbb{R}^n$ is the state, $u_t \in \mathbb{R}^m$ the control, and $W_t$ is an $m$ -dimensional Brownian motion. The goal is to minimize the cost functional

$J(u) = \mathbb{E}\Bigl[\int_0^T r(X_t, u_t)\,dt + g(X_T)\Bigr].$

The associated HJB PDE with the generalized Hamiltonian

$G(t, x, u, p, P) = p \cdot b(x,u) + \frac{1}{2} \operatorname{Tr}[P\,\sigma(x,u)\sigma(x,u)^\top] - r(x,u)$

requires explicit time-awareness in both the value and policy updates. The critic is designed as a continuous-time LSTD estimator using a modified temporal-difference error derived from Itô calculus. The core TD error for a policy $u(\cdot)$ is

$\delta = \int_0^T r(X_t, u_t)\,dt + g(X_T) - V_0(X_0) - \int_0^T G(t, X_t) \sigma(X_t, u_t)\,dW_t,$

where $V_0(x) \approx V(0,x)$ and $G(t,x) \approx \nabla_x V(t,x)$ are parameterized. The continuous-time TD loss is minimized via gradient descent, resulting in a coupled ODE flow for the critic parameters.

The actor is updated by a functional policy gradient in continuous time,

$\frac{d}{d\tau} u^\tau(t, x) = \alpha_a \rho^\tau(t,x) G(t,x,u^\tau, -G^\tau),$

where $\rho(t,x)$ is the time-marginal density and $G^\tau$ is the current critic’s estimate. The actor-critic system forms a 3-component ODE system whose global linear convergence is proven under standard smoothness, strong concavity, and ellipticity assumptions, provided a fixed timescale ratio $\alpha_a/\alpha_c$ . The critic’s TD update accumulates the explicit time integral and time-dependent martingale, which directly injects time-awareness into the learning process.

3. Temporally Abstract Action in Actor-Critic: TAAC

Another axis of time-awareness is temporal abstraction in discrete-time RL environments, exemplified by the TAAC algorithm (Temporally Abstract Actor-Critic) (Yu et al., 2021). TAAC augments the base actor-critic agent with a closed-loop action repetition mechanism: after sampling a candidate action from the actor, a binary switch policy decides whether to repeat the previous action or switch to the new one. This mixture policy over one-step and repeated actions can be expressed as: $\pi^{\mathrm{ta}}(a | s, a^-) = \int_{\hat{a}} \pi_a(\hat{a}|s,a^-) [\pi_s(0|s,a^-,\hat{a}) \delta(a-a^-) + \pi_s(1|s,a^-,\hat{a})\delta(a-\hat{a})]\,d\hat{a}$ where the switch policy $\pi_s$ is parameterized by a neural network taking state, previous action, and candidate action as input. The key innovation is that temporal abstraction is not fixed (as in open-loop repetition) but selected in a closed-loop, observation-conditional manner.

For learning, a novel compare-through Q operator extends the TD backup along action-repeated trajectory segments—leveraging the temporal coherence conferred by repetition. This multi-step backup is performed only for contiguous segments where the agent’s repeated action matches the actual next action, thus avoiding off-policy correction by importance weighting. The critic minimizes the Bellman error constructed over these variable-length segments. Empirically, TAAC produces persistent exploration, faster sample efficiency, and high asymptotic scores across diverse continuous control tasks, with substantial proportions of action repetition even in environments not a priori suited to it (89% in MountainCarContinuous, 39% in BipedalWalker).

4. Free Terminal Time and State-Dependent Discounting

Time-awareness can also be realized by optimizing over terminal horizons, that is, allowing policies to select when to terminate episodes. In the context of optimal control with free terminal time (2208.00065), the actor-critic paradigm is extended by applying a Kruzhkov exponential transformation to recast the value function as

$\tilde{V}(x) = 1 - e^{-V(x)},$

where $V(x)$ is the minimal path cost over all admissible controls and terminal times. The Bellman operator becomes

$\tilde{V}(x) = \min_{u \in U} \left\{1 + \gamma(x,u)[\tilde{V}(x+\Delta t\,f(x,u)) - 1] \right\},$

with state-dependent discount $\gamma(x,u) = e^{-\Delta t \ell(x,u)}$ . The actor-critic updates are performed with networks parameterizing both $\tilde{V}(x)$ and $U(x)$ , fitting the critic to one-step rollouts and adjusting the actor to minimize the transformed Bellman cost. Empirical results on the double-integrator and complex multi-dimensional attitude control demonstrate rapid convergence (∼1000 iterations), accuracy (MSE $\sim 10^{-2}$ ), and scalability.

5. Continuous-Time Actor-Critic for Model-Based Reinforcement Learning

In problems where the environment’s true dynamics follow continuous-time ODEs, time-aware actor-critic algorithms avoid discretization artifacts by operating directly on the continuous process and by inferring unknown vector fields using neural ODEs (Yıldız et al., 2021). The environment is modeled as

$\dot{s}(t) = f(s(t), a(t)),$

with an ensemble of Bayesian neural ODEs used to approximate $f$ and capture epistemic uncertainty.

The value function is defined as a time-continuous discounted integral,

$V^\pi(s(t), t) = \mathbb{E}_\pi\left[\int_t^\infty e^{-\frac{\tau - t}{\eta}} r(s(\tau), a(\tau))\,d\tau\right],$

circumventing the ill-posedness of Q-functions in continuous time. To train the critic, a finite-horizon surrogate is used, and the loss minimizes the squared error between the predicted value and the rollout (using imagined trajectories under sampled network weights). The actor is trained by maximizing expected returns through the ODE-integrated environment, backpropagating directly through the ODE solver for policy gradient updates. Empirical results show state-of-the-art sample efficiency, robustness to irregular time sampling and noise, and smooth uncertainty-aware planning policies.

6. Analysis of Timescales and Learning Dynamics

Time-awareness is also addressed by adjusting the timescales of actor and critic updates, as detailed in the two-time-scale framework for tabular and function-approximation actor-critic and critic-actor algorithms (Bhatnagar et al., 2022). The standard order—where critic updates are faster than those of the actor—yields a scheme whose behavior matches policy iteration, whereas reversing the timescales (critic-actor) emulates value iteration. Both versions are provably convergent under standard stochastic approximation conditions, and empirical results indicate matched performance and computational effort.

The choice of which component runs at the faster timescale can influence the algorithm’s flavor (policy iteration vs. value iteration), but does not afford a practical advantage in terms of speed or accuracy under sufficient function approximation. Step-size tuning remains crucial, and modern nonlinear network approximators typically eliminate any structural bias in favor of a particular timescale configuration.

7. Implementation and Theoretical Guarantees

Implementation of time-aware actor-critic models varies depending on the temporal structure at play:

Continuous-time SDE control: Trajectories are simulated via Euler–Maruyama discretization, TD errors are accumulated over the full time-horizon, and both actor and critic networks are updated using stochastic optimization (e.g., Adam) with empirical losses derived from the underlying ODEs or SDEs.
TAAC-style action repetition: Requires adding an additional binary-switch network, tracking true executed actions, and modifying the TD backup to use multi-step "compare-through" targets; all components are amenable to standard deep RL infrastructure.
Free-terminal-time optimal control: Temporal abstraction is handled via transformed value functions and one-step rollouts; actor and critic networks are trained alternately via batch optimization, using boundary-state samples to ensure proper value function anchoring.
Bayesian neural ODEs for model-based RL: Neural parameter uncertainty is handled via variational inference; training requires nested optimization over Bayesian ODE parameters and actor-critic network weights.

Remarkably, under mild smoothness and concavity assumptions, time-aware continuous-time actor-critic frameworks yield global linear convergence rates and provide rigorous guarantees on the decay of both policy suboptimality and parameter estimation error (i.e., $L^\tau \leq L^0 e^{-c \tau}$ ), a property unique to the continuous-time ODE formulation (Zhou et al., 27 Feb 2024).

In summary, time-aware actor-critic models encompass a range of architectures and algorithms that embed the temporal structure of the environment—whether through explicit time indices, temporal abstraction, state-dependent discounting, or timescale design—into the learning process. These models enable accurate learning and control in truly continuous-time systems, improve exploration and credit assignment efficiency, and deliver robust theoretical guarantees under broad conditions.