Time-Aware Q-Networks (TQN) Overview

Updated 25 November 2025

Time-Aware Q-Networks (TQN) are a deep reinforcement learning framework that augments state representations with elapsed time and uses continuous-time discounting.
The approach refines policy learning by integrating time-augmented inputs and adjusting reward propagation, addressing non-uniform temporal dynamics in real-world scenarios.
Empirical results demonstrate that TQN improves convergence rates and stability across benchmarks and complex domains such as healthcare and industrial control.

Time-Aware Q-Networks (TQN) is a general deep reinforcement learning (DRL) framework that explicitly incorporates physical time intervals into both state representations and reward discounting to address the temporal irregularity present in many real-world sequential decision-making problems. Unlike standard DQN, which assumes discretely- and uniformly-sampled transitions, TQN augments the policy learning process to handle variable and non-uniform time steps between events, thus enabling agents to capture latent progressive state patterns and reason over non-uniform temporal dynamics (Kim et al., 2021).

1. Motivation and Background

Standard deep RL methods, typified by Deep Q-Networks (DQN), process event sequences as if they occur on a regular discrete-time lattice, represented by transitions $(s_i,a_i,r_i,s_{i+1})$ at fixed intervals. However, real-world data often manifest as irregular sequences, where the elapsed time between observations is variable. This temporal irregularity can obscure or distort latent dynamics, undermining the efficacy of DRL in domains like healthcare, industrial operations, and certain game environments.

TQN addresses the limitations of classical DRL by:

Integrating elapsed and anticipated time intervals into state representations (“Time-aware State Approximation”)
Adopting a continuous-time reward discounting scheme (“Time-aware Discounting”)

These mechanisms enable TQN to capture environment evolution speed and value future rewards as a function of real elapsed time, rather than merely the step count.

2. TQN Architecture

TQN fundamentally modifies both the input to the Q-network and the discount factor in Bellman updates.

Time-aware input: At each decision point $i$ , the observed feature vector $\mathbf o_i$ is paired with the elapsed time $\Delta t_i$ since the last observation. The latent state is then defined as $\hat s_i = [s_i, \Delta t_i]$ , where $s_i$ is a (potentially low-dimensional) encoding of the last $c$ observations and associated inter-arrival times.
Q-value estimation: The neural network (which can be a dense net, LSTM, or CNN depending on the environment) estimates $Q(\hat s_i, a; \theta)$ .
Time-aware discounting: The standard fixed discount $\gamma$ is replaced with a function $\Gamma(\Delta t_i)$ that discounts rewards according to the real elapsed time between transitions.

A summary of the main architectural departures from DQN is provided in the following table:

Component	DQN	TQN
State Input	$s_i$	$\hat s_i = [s_i, \Delta t_i]$
Discount factor	Fixed $\gamma$	Continuous-time: $\Gamma(\Delta t_i)$
Time interval awareness	None	Past and future elapsed intervals

3. Time-Aware State Approximation ("TState")

In TQN, the latent state representation is constructed by encoding both the past history of observations and their associated inter-arrival times, as well as the expected time until the next measurement.

With the last $c$ raw observations $\mathbf O_i = (o_{i-c+1},\ldots,o_{i})$ and inter-arrival times $\Delta T_i = (\Delta t_{i-c+1},\ldots,\Delta t_{i-1})$ , a non-linear encoder (typically an LSTM) computes $s_i = \varphi(\mathbf O_i, \Delta T_i)$ .
The time-aware latent state is then completed as $\hat s_i = [s_i, \Delta t_i]$ , where $\Delta t_i$ is the expected time to the next observation.

This structure allows the agent to adapt its representations according to both recent dynamics and future temporal expectations, facilitating learning in environments evolving at variable rates.

4. Time-Aware Discounting ("TDiscount")

TQN replaces the fixed exponential discount factor of standard RL with a continuous-time decay function:

$\Gamma(\Delta t_i) = b^{\Delta t_i/\tau}$

where $\tau$ denotes a user-specified "action time window," and $b \in (0,1)$ reflects the strength of preference or belief in rewards accrued within $\tau$ . This parameterization ensures that the cumulative discount across intervals $(\Delta t_1, \ldots, \Delta t_w)$ , when summing to $\tau$ , yields an overall discount $b$ .

The Bellman update thus becomes:

$Q(\hat s_i, a_i) \leftarrow (1-\alpha)Q(\hat s_i, a_i) + \alpha \bigl[r_i + \Gamma(\Delta t_i) \max_{a'} Q(\hat s_{i+1}, a')\bigr]$

This allows temporal credit assignment to accurately reflect actual elapsed time, rather than step count, thus preserving value propagation dynamics for irregular action-observation sequences.

5. TQN Training Workflow

TQN training proceeds analogously to DQN, with key adaptations for time-awareness. Transitions are stored as tuples including time-augmented states and actual elapsed intervals. Minibatches from the experience replay buffer are sampled and TD-updates are computed using time-aware states and continuous-time discount factors.

Key workflow steps:

Encode each transition as $(\hat s, a, r, \Delta t, o', t')$
For each sampled transition, compute the Bellman target using $\Gamma(\Delta t_j)$ and next time-augmented state $\hat s_{j+1}$
Perform gradient descent to minimize the Bellman error over a minibatch
Periodically synchronize target and online Q-network parameters

A plausible implication is that TQN’s backward compatibility with DQN enables direct extension to off-the-shelf RL environments by augmenting input processing and value updates.

6. Empirical Results and Comparative Performance

TQN was extensively evaluated against DQN, TState (state augmentation only), and TDiscount (discounting only) across classic RL benchmarks, Atari games with artificially introduced temporal irregularity, and real-world domains with intrinsic time gaps.

CartPole (max episode=200): TDiscount achieved ≈15% faster convergence than DQN; TState marginally improved performance; full TQN matched or slightly exceeded DQN.
MountainCar: TState and TDiscount both provided significant gains, with TQN solving the task in approximately 2,000–3,000 fewer episodes than DQN.
Atari—Sample Results:

Game	DQN Score	TDiscount (%)	TState (%)	TQN (%)
CrazyClimber	52k	+14	+1.6	+7.4
Seaquest	3.3k	+88	+3.9	+69.7
Up’n’Down	18k	+31	–4	+29
Frostbite	3.4k	+8	+2	+22
MontezumaRevenge	0	+∞ (116)	0	+∞ (429)
Ms. Pacman	2.9k	+16	+6	+22

Nuclear reactor control: Baseline DQN did not yield a safe controller. The combination of TQN with Double DQN, Dueling Network, and Prioritized Experience Replay (PDD-TQN) reduced peak reactor fuel temperature (~696 °C → 668 °C), hazard rate (~0.95 → 0.01), and improved cumulative utility (~+13 → +57).
Septic patient treatment: Offline DQN produced ~8.9% septic shock rate at 90% agent-physician agreement. Adding PDD alone increased shock rates (~27%), whereas PDD-TQN achieved ~3.4% shock rate and matched physician trajectories in ~12% of cases (vs 4.4% for DQN).

Ablation experiments revealed that in synthetic benchmarks, time-aware discounting (TDiscount) dominated performance improvements for fast-paced games, whereas real-world tasks benefited most from the full TQN (TState+TDiscount).

7. Integration with Boosting Methods

TQN is orthogonal to established DQN boosting strategies:

Double DQN: Mitigates value overestimation
Dueling DQN: Separates state value and advantage estimation
Prioritized Experience Replay: Samples transitions with high TD error more frequently

In synthetic domains, single or paired boosting techniques sufficed. In complex real-world tasks, only the joint application of all three with TQN (“PDD-TQN”) achieved both stable and effective policy learning. Ablation studies demonstrated that no single boosting method, or any pair, was sufficient to unlock the performance benefits of time-awareness in TQN; the full combination provided both learning stability and optimal exploitation of temporal structure.

(Kim et al., 2021)