Controlled Markov Noise

Updated 9 April 2026

Controlled Markov noise is a stochastic process framework that integrates control variables with Markovian randomness, enabling adaptive learning in non-stationary environments.
It leverages differential inclusions and two-timescale methods to analyze convergence and stability when noise is driven by algorithm states and control actions.
Applications include reinforcement learning, networked control, and filtering, offering robust strategies for handling non-i.i.d. and state-dependent noise.

Controlled Markov noise refers to a class of stochastic processes and associated approximation algorithms in which Markovian randomness is present, but where the transition dynamics of the underlying Markov process are modulated by control variables or by the internal state of the main process (the “iterates” in stochastic approximation). This construct arises centrally in reinforcement learning, control theory, estimation, adaptive algorithms, and the analysis of stochastic dynamical systems, where the noise structure is neither independent nor stationary, but driven by policies or algorithm state.

1. Controlled Markov Processes and Formulation

A controlled Markov process is defined on a compact metric state space $S$ and control space $U$ . The controlled process $\{Z_n\}$ evolves as

$P\bigl(Z_{n+1}\in A \mid Z_n, U_n, X_n \bigr) = \int_A p(dz \mid Z_n, U_n, X_n)$

where

$X_n$ is a $\mathbb{R}^d$ -valued iterate, possibly the main algorithmic variable,
$U_n$ an exogenous control (which may itself be random or adaptive),
$p(\cdot \mid z, u, x)$ a Markov kernel, continuous in $(z, u, x)$ .

This framework subsumes the situation in which the noise injected into an algorithm at iteration $n$ comes not from a fixed (i.i.d.) source, but from a stochastic process whose conditional law depends on both the current iterate and the chosen control. The noise is thus both Markov and controlled—referred to as “controlled Markov noise” (Ramaswamy et al., 2015, Karmakar et al., 2015, Karmakar, 2020).

2. Stochastic Approximation with Controlled Markov Noise

Stochastic approximation (SA) algorithms with controlled Markov noise take the form

$U$ 0

where

$U$ 1 is a step-size sequence,
$U$ 2 is the “drift” function,
$U$ 3 is a martingale-difference noise sequence (vanishing in expectation given past),
$U$ 4 is a controlled Markov chain as in the previous section.

A key feature is that $U$ 5 may be non-ergodic under stationary policies and may evolve in a continuous space. This is critical for RL applications, which often work with continuous state-spaces or adaptively chosen policies (Ramaswamy et al., 2015).

The convergence and stability of such SA schemes require generalized arguments, as traditional ergodicity or independence of the noise cannot be assumed.

3. Limit Theorems, Differential Inclusions, and Stability

Modern analysis of SA with controlled Markov noise utilizes the ODE (or more generally, set-valued differential inclusion) method. Key steps include:

Rescaled Drift and Limit Drift:

$U$ 6

and the set-valued limit drift

$U$ 7

Stability Theorem:

If $U$ 8 is Lipschitz in $U$ 9, the controlled Markov process is as specified, martingale-difference noise has bounded second moment, and step sizes are standard ( $\{Z_n\}$ 0), then the ODE

$\{Z_n\}$ 1

with a compact attracting set implies

$\{Z_n\}$ 2

(Ramaswamy et al., 2015)

Similar frameworks exist for set-valued drift functions and for more general schemes where noise is non-additive, iterate-dependent, or includes Markov switching (Yaji et al., 2016).

4. Two-Timescale and Limit-Inclusion Theory

Multi-timescale algorithms with controlled Markov noise are analyzed using coupled recursions: $\{Z_n\}$ 3 with $\{Z_n\}$ 4.

For each fixed $\{Z_n\}$ 5, ergodic occupation measures $\{Z_n\}$ 6 of the controlled Markov processes are used to define averaged (set-valued) vector fields. The limiting behavior is characterized by differential inclusions,

$\{Z_n\}$ 7

where $\{Z_n\}$ 8 is the globally attracting equilibrium for the fast-scale inclusion. Under stability and attractor assumptions, almost sure convergence to chain-transitive sets of the slow flow is established (Karmakar et al., 2015, Karmakar, 2020).

This approach generalizes the classical ODE method, allowing for Markov iterates, coupling between timescales, and non-additive noise.

5. Applications in Reinforcement Learning and Control

Controlled Markov noise is intrinsic in reinforcement learning algorithms—most notably, temporal-difference learning and actor-critic methods—where the observed data stream is Markov and depends both on the parameter updates and the interaction policy.

TD(0) with Function Approximation: Embeds the update in the SA framework with controlled Markov noise, verifying stability via the limiting ODE and confirming convergence to the correct Bellman fixed point under well-posedness (Ramaswamy et al., 2015).
Off-Policy Temporal Difference (TDC) Learning: The two-timescale SA theory with controlled Markov noise provides the first online convergence proof for off-policy TD(0) with linear function approximation, under single on-policy data, for any discount $\{Z_n\}$ 9 and arbitrary deterministic or randomized target policy (Karmakar et al., 2015, Karmakar, 2020).
Risk-Sensitive Policy Evaluation: Explicit error bounds for risk-sensitive function approximation are derived via new spectral bounds utilizing the generalized framework (Karmakar, 2020).

A summary of such applications and the underlying methodology appears in (Karmakar, 2020, Ramaswamy et al., 2015, Yaji et al., 2016).

6. Generalization: Set-Valued Drift, Non-Additive, and Hidden Markov Cases

Analyses of stochastic approximation with controlled Markov noise generalize to cases where the drift function is set-valued, non-additive, or coupled with hidden Markov processes.

Set-Valued (Differential Inclusions): Algorithms are of the form

$P\bigl(Z_{n+1}\in A \mid Z_n, U_n, X_n \bigr) = \int_A p(dz \mid Z_n, U_n, X_n)$ 0

with Markov noise in $P\bigl(Z_{n+1}\in A \mid Z_n, U_n, X_n \bigr) = \int_A p(dz \mid Z_n, U_n, X_n)$ 1 driven by the current iterate, and $P\bigl(Z_{n+1}\in A \mid Z_n, U_n, X_n \bigr) = \int_A p(dz \mid Z_n, U_n, X_n)$ 2 a set-valued drift (Yaji et al., 2016).

Applications: The general theory supports controlled stochastic approximation, subgradient descent under Markov noise, approximate-drift algorithms, and recursions with only measurable (possibly discontinuous) drift functions. Limiting asymptotic behavior is characterized by averaged differential inclusions with respect to the stationary distributions of the Markov noise.
Hidden Markov Chains and Filtering: In problems with hidden Markov chains modulating drift or volatility (e.g., partially observed Markov-modulated diffusion processes), the noise affecting the decision process remains controlled and Markovian, and filtering (e.g., Wonham filter) reduces the model to a fully observed system with controlled Markov noise (Yang et al., 2014, Muravlev et al., 2019).

7. Controlled Markov Noise in Communication and Networked Control Systems

Control under networked or unreliable communication conditions naturally induces controlled Markov noise due to action-dependent stochastic packet drops.

Action-Dependent Markov Drop Models: The communication channel's state is a Markov chain whose transitions depend on the transmission decisions (control actions) and whose state determines packet drop probabilities. The random feedback to the plant and estimation error thus exhibits controlled Markov noise (Bose et al., 2019).
Event-Triggered Control and Stability: Event-triggered transmissions based on the expected performance function are designed using the controlled Markov noise model to ensure exponential convergence of the second moment and explicit upper bounds on the long-run transmission fraction (Bose et al., 2019).

This demonstrates the generality and practical necessity of the controlled Markov noise formalism in modern stochastic control, estimation, and learning applications.

References:

(Ramaswamy et al., 2015, Karmakar et al., 2015, Karmakar, 2020, Yaji et al., 2016, Yang et al., 2014, Muravlev et al., 2019, Bose et al., 2019)