Dynamic Q-Learning: Adaptive Reinforcement Learning

Updated 2 December 2025

Dynamic Q-Learning is a reinforcement learning framework that adapts traditional Q-learning to handle nonstationary environments, temporal variability, and multi-agent dynamics.
It leverages Bellman recursion and stochastic approximation techniques to update Q-values in both synchronous and asynchronous settings, ensuring convergence under challenging conditions.
Applications of Dynamic Q-Learning span dynamic pricing, robotic path planning, and dynamic treatment regimes in personalized medicine, highlighting its versatility and practical impact.

Dynamic Q-Learning refers to a family of reinforcement learning and dynamic programming algorithms grounded in the Q-learning principle, but adapted to settings in which the environment, agent objectives, or structural constraints exhibit nonstationarity, temporal variability, or game-theoretic multi-agent interactions. This encompasses both classic model-free Q-learning applied to dynamic Markov Decision Processes (MDPs), as well as extensions to general stochastic dynamic programming, two-player zero-sum games, high-dimensional control, and data-driven economic optimization. Dynamic Q-learning algorithms leverage Bellman recursion for action-value (Q) functions and update procedures that accommodate both synchronous and asynchronous, online and offline, or single- and multi-agent environments.

1. Foundations of Dynamic Q-Learning in Stochastic Control and Games

Stochastic dynamic programming provides the theoretical backbone for Q-learning algorithms in dynamic or time-evolving settings. The foundational dynamic Q-learning framework was formalized by Yu in the context of two-player zero-sum stochastic shortest-path (SSP) games, involving finite state and action/control spaces, an absorbing cost-free termination state, and unrestricted (undiscounted) total cost criteria (Yu, 2014).

An SSP game is defined by:

A finite set of nonterminal states $S$ and an absorbing state $0$, with state space $S_0=S\cup\{0\}$ .
In each state $i\in S$ , player I chooses control $u\in U(i)$ , player II chooses $v\in V(i)$ —the control sets are compact or finite.
Transition probabilities $p_{ij}(u,v)$ and stage cost $g(i,u,v)$ .
The agent’s objective is to minimize the expected (undiscounted) total accumulated cost to reach the termination state.

The dynamic programming equation in this setting generalizes the Bellman operator to the Shapley–Bellman form:

$(TJ)(i) = \min_{u\in U(i)}\max_{v\in V(i)}\bigl\{g(i,u,v)+\sum_j p_{ij}(u,v) J(j)\bigr\}.$

The Q-learning fixed-point formulation for such games becomes:

$Q(i,u,v) = g(i,u,v) + \sum_j p_{ij}(u,v)\min_{u'}\max_{v'} Q(j,u',v').$

This Q-equation establishes the value recursion over the joint state-control space and facilitates both asynchronous and fully distributed learning updates.

2. Dynamic Programming Equations, Operator Theory, and Convergence

Dynamic Q-learning’s reliability and efficacy crucially depend on contraction and nonexpansiveness properties of the underlying Bellman or Shapley–Bellman operators.

In SSP and dynamic games, as studied by Yu, the Shapley–Bellman operator $T$ is proven to be monotone and $1$-Lipschitz in the supremum norm, but not necessarily a contraction. The convergence of Q-learning in this case is established via stochastic approximation theory for monotone, nonexpansive operators, contingent on boundedness and uniqueness of fixed points. The main technical innovation in (Yu, 2014) is a sufficient-conditions proof of almost-sure boundedness of Q-iterates, filling a long-standing gap in the analysis where prior results required overly restrictive always-terminating or artificial boundedness assumptions.

The general stochastic approximation update for Q-learning in this context is:

$Q_{t+1}(i,u,v) = Q_t(i,u,v) + \alpha_{t,\ell}[\hat g(i,u,v,s) + \min_{u'}\max_{v'} Q_t(s,u',v') - Q_t(i,u,v)],$

where $\alpha_{t,\ell}$ are step-sizes obeying standard stochastic approximation conditions ( $\sum_t \alpha_{t,\ell} = \infty$ , $\sum_t \alpha_{t,\ell}^2 < \infty$ ), and $Q_t$ may update asynchronous or delayed components. This framework guarantees convergence for a broad class of undiscounted, total-cost, zero-sum games (Yu, 2014).

3. Extensions and Applications: Control, Planning, and Revenue Management

Dynamic Q-learning extends naturally to address a diverse array of time-dependent and feedback-driven decision problems:

Dynamic Pricing. In “Dynamic Retail Pricing via Q-Learning,” dynamic Q-learning is employed to optimize retail pricing strategies in nonstationary environments (Apte et al., 2024). The environment is modeled as an MDP where each state encodes product ID and day type, actions are discretized price adjustments, and the reward is profit computed over a price-dependent demand curve. A standard tabular Q-learning update is used:

$Q_{t+1}(s,a) = Q_t(s,a) + \alpha\Bigl[ r + \gamma\max_{a'} Q_t(s',a') - Q_t(s,a) \Bigr]$

This data-driven approach adaptively learns to exploit cyclically-varying demand and outperforms static optimization, providing a mean 5.3% profit improvement across products.

Dynamic Path Planning. In robotic path planning for dynamic, uncertain settings, Q-learning mechanisms are employed together with reduced state representations and adaptive reward shaping to achieve high hit rates and rapid convergence in highly stochastic 3D obstacle environments (Jeihaninejad et al., 2019).
Continuous-Time and Continuous-Control Systems. Hamilton–Jacobi Deep Q-Learning generalizes dynamic Q-learning to deterministic, continuous-time optimal control with Lipschitz control actions, formulating a novel semi-discrete Bellman/HJB recursion and neural Q-learning scheme (Kim et al., 2020). This enables direct parametric learning of optimal policies in high-dimensional dynamical systems.

4. Generalizations: Unbounded Rewards, Function Approximation, and Goal-Conditioning

Dynamic Q-learning frameworks have been broadened along several axes:

Unbounded Rewards via the Q-Transform. When rewards are unbounded, the classical supremum-norm contraction fails. “Unbounded Dynamic Programming via the Q-Transform” proposes the use of a positive “gauge” function $h(s)$ to yield a normalized action-value $Q_h(s,a)$ , stabilizing learning and recovering contraction (Ma et al., 2020). Stochastic approximation for the Q-transform provides geometric convergence guarantees even for unbounded dynamic programs.
Function Approximation and Offline RL. The Q-learning Decision Transformer and variants combine offline, dataset-driven dynamic programming with sequence modeling by transformers, where Q-learning-derived value estimates are used to relabel returns-to-go in sub-optimal or compositional data (Yamagata et al., 2022). This addresses the challenge of “trajectory stitching,” enabling DTs to generalize beyond observed suboptimal trajectories.
Goal-Conditioned RL and Weighted Supervised Learning. Dynamic Q-learning architectures such as Q-WSL integrate dynamic programming-based Bellman updates with advantage-weighted supervised policy refinement to improve sample efficiency, robustness (especially under sparse rewards), and trajectory stitching capacity in goal-conditioned RL, outperforming both pure TD and pure behavior cloning methods (Lei et al., 2024).

5. Statistical Learning and Dynamic Q-Learning in Biomedical Decision-Making

A major domain for dynamic Q-learning is the estimation of optimal dynamic treatment regimes (DTRs) in sequential medical decision-making:

Multi-Stage Q-Learning for DTRs. The standard algorithm for dynamic Q-learning in this context is recursive backward induction (parametric or nonparametric regression for pseudo-outcomes), with the Q-function at each stage representing the conditional mean future reward given historical covariates and actions (Schulte et al., 2012). Robustness, efficiency, and consistency require appropriate model specification and methodological extensions to handle missing data, measurement error, and non-regularity (Sun et al., 2023, Liu et al., 2024, Song et al., 2011).
Handling Data Imperfections.
- Misclassified Outcomes: Maximum likelihood corrections using internal validation data restore the consistency of estimated dynamic regimes in the presence of misclassified binary outcomes (Liu et al., 2024).
- Nonignorable Missing Covariates: Weighted Q-learning using inverse-probability weighting (with sensitivity analysis or instrumental variable-based estimation) prevents bias propagation from nonignorable missing pseudo-outcomes (Sun et al., 2023).
- Measurement Error: Regression calibration with replicate error-prone covariates delivers unbiased treatment rules and accurate inference in DTR contexts (Liu et al., 2024).
Interactive and Penalized Q-Learning: Dynamic Q-learning generalizes to quantile and probability optimization criteria (Linn et al., 2014), and to penalized regression or variable selection settings to resolve non-regularity in treatment effect estimation (Song et al., 2011).

6. Theoretical and Practical Implications

The proliferation of dynamic Q-learning methodologies has several key implications:

Theoretical Guarantees. Dynamic Q-learning, under established model assumptions (cost-free termination, proper controls in SSPs, monotonicity, and boundedness via auxiliary constructions), enjoys convergence guarantees even in noncontractive, zero-sum, or unbounded settings (Yu, 2014, Ma et al., 2020). Consistency, asymptotic normality, and “oracle” properties in statistical DTR extensions have been proven under regularity and model selection conditions (Song et al., 2011).
Stability and Sample Efficiency. Modern dynamic Q-learning algorithms integrate supervised objectives, experience replay, and function approximation to stabilize learning, ensure robustness to suboptimal data or sparse rewards, and accelerate convergence in both simulation and real applications (Lei et al., 2024, Yamagata et al., 2022).
Limitations and Future Directions. Remaining challenges concern optimal tuning of step-sizes and function approximation architectures, adaptivity to highly nonstationary environments, and robustness under adversarial or model-misspecified dynamics. Extensions to multi-agent, competitive, and cooperative settings remain active areas of research.

Dynamic Q-learning, in its numerous incarnations, constitutes a central algorithmic paradigm across reinforcement learning, optimal control, and sequential decision-making, unifying statistical, game-theoretic, and algorithmic perspectives on learning in dynamic environments and providing foundations for high-impact applications in robotics, economics, and personalized medicine (Yu, 2014, Apte et al., 2024, Ma et al., 2020, Kim et al., 2020, Liu et al., 2024, Lei et al., 2024, Sun et al., 2023, Schulte et al., 2012, Liu et al., 2024, Song et al., 2011, Yamagata et al., 2022, Linn et al., 2014, Jeihaninejad et al., 2019).

Markdown Upgrade to Chat

References (13)

Stochastic Shortest Path Games and Q-Learning (2014)

Dynamic Retail Pricing via Q-Learning -- A Reinforcement Learning Framework for Enhanced Revenue Management (2024)

D-Point Trigonometric Path Planning based on Q-Learning in Uncertain Environments (2019)

Hamilton-Jacobi Deep Q-Learning for Deterministic Continuous-Time Systems with Lipschitz Continuous Controls (2020)

Unbounded Dynamic Programming via the Q-Transform (2020)

Q-learning Decision Transformer: Leveraging Dynamic Programming for Conditional Sequence Modelling in Offline RL (2022)

Q-WSL: Optimizing Goal-Conditioned RL with Weighted Supervised Learning via Dynamic Programming (2024)

$Q$- and $A$-Learning Methods for Estimating Optimal Dynamic Treatment Regimes (2012)

Weighted Q-learning for optimal dynamic treatment regimes with nonignorable missing covariates (2023)

10.

Q-learning in Dynamic Treatment Regimes with Misclassified Binary Outcome (2024)

11.

Penalized Q-Learning for Dynamic Treatment Regimes (2011)

12.

Dynamic Treatment Regimes with Replicated Observations Available for Error-prone Covariates: a Q-learning Approach (2024)

13.

Interactive Q-learning for Probabilities and Quantiles (2014)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Dynamic Q-Learning.

Dynamic Q-Learning: Adaptive Reinforcement Learning

1. Foundations of Dynamic Q-Learning in Stochastic Control and Games

2. Dynamic Programming Equations, Operator Theory, and Convergence

3. Extensions and Applications: Control, Planning, and Revenue Management

4. Generalizations: Unbounded Rewards, Function Approximation, and Goal-Conditioning

5. Statistical Learning and Dynamic Q-Learning in Biomedical Decision-Making

6. Theoretical and Practical Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

Dynamic Q-Learning: Adaptive Reinforcement Learning

1. Foundations of Dynamic Q-Learning in Stochastic Control and Games

2. Dynamic Programming Equations, Operator Theory, and Convergence

3. Extensions and Applications: Control, Planning, and Revenue Management

4. Generalizations: Unbounded Rewards, Function Approximation, and Goal-Conditioning

5. Statistical Learning and Dynamic Q-Learning in Biomedical Decision-Making

6. Theoretical and Practical Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research