Transition Kernel Recovery in Markov Chains

Updated 5 March 2026

Transition kernel recovery is a method to estimate the transition matrix of a Markov chain by decomposing the frequency matrix into a low-rank component and a sparse correction.
It employs a constrained least-squares approach that achieves deterministic error bounds and minimax rate-optimality under arbitrary noise dependencies.
Alternating minimization algorithms and separation lemmas drive efficient computations and robust theoretical guarantees in structured matrix recovery.

Transition kernel recovery is the problem of estimating the transition probability matrix of a Markov chain from observed data, particularly when the matrix admits a low-rank plus sparse decomposition with inherent incoherence. The recovery is motivated by the need to consistently estimate the structure of Markov kernels—even under arbitrary noise dependence among matrix entries—which is critical in statistical machine learning problems such as structured sequence modeling, multitask regression, covariance estimation, and reinforcement learning. The state-of-the-art theoretical and algorithmic framework proceeds by representing the Markov frequency matrix as the sum of a low-rank incoherent component and a sparse incoherent correction, then recovering both efficiently via a constrained least-squares approach with deterministic optimality guarantees, tight minimax rates, and extension to reinforcement learning conditional mean estimation (Chai et al., 2024).

1. Formal Framework for Structured Transition Kernel Recovery

Consider a discrete-time, time-homogeneous, ergodic, aperiodic Markov chain $(X_0, X_1, ..., X_n)$ over a finite state space of size $p$ , with true but unknown transition kernel $P^* \in \mathbb{R}^{p \times p}$ , where $P^*_{ij} = \Pr(X_{t+1}=j \mid X_t = i)$ , $P^* \mathbf{1}_p = \mathbf{1}_p$ , $P^* \geq 0$ . The stationary distribution $\pi^*$ satisfies $\pi^* P^* = \pi^*$ . The central object is the long-run "frequency matrix" $F^* = \operatorname{diag}(\pi^*) P^*$ , which is sufficient for recovering $P^*$ via $P^*_{ij} = F^*_{ij} / \sum_j F^*_{ij}$ .

A structural assumption posits $F^* = L^* + S^*$ , where $L^*$ is low-rank (rank $r$ ) and incoherent, while $S^*$ is sparse (at most $s$ nonzero entries) and incoherent. The model permits arbitrary joint dependence in the observed noise matrix $W=F-F^*$ , which captures deviations between empirical counts $F$ and $F^*$ ; $F_{ij} = \sum_{t=0}^{n-1} 1_{\{X_t=i, X_{t+1}=j\}}/n$ .

Essential assumptions for identifiability and estimability include:

Restricted strong convexity holds trivially for Frobenius loss with identity design.
The incoherence of $L^*$ as $\|U^*\|_{2,\infty}, \|V^*\|_{2,\infty} \leq \sqrt{\mu r / p}$ , for $L^* = U^* \Sigma^* V^{*T}$ .
Sparsity $\|S^*\|_0 \leq s \leq p/(8cr^4)$ .
Markov chain mixing: $\pi^*_{\min}>0$ , mixing time $\tau^*$ , and $\pi^*_{\max}$ .
No further assumption on $W$ beyond arbitrary entrywise dependence (Chai et al., 2024).

2. Incoherent-Constrained Least-Squares Estimator

Transition kernel recovery is formalized as a structured matrix estimation problem via the following optimization: $\begin{aligned} (\widehat{L}, \widehat{S}) = \arg\min_{L, S, U, V, \Sigma} &\;\frac{1}{2}\|F-(L+S)\|_F^2 \ \text{subject to } & L=U\Sigma V^T, U,V \in \mathcal{O}_{p,r}^{\bar\mu}, \Sigma \text{ diagonal}, \|S\|_0 \leq s, \end{aligned}$ where $\mathcal{O}_{p,r}^{\bar\mu}$ is the set of $p\times r$ semi-orthogonal, $\bar\mu$ -incoherent matrices. The regularization parameters $\bar\mu$ (controls incoherence) and $s$ (controls sparsity) encode the structural prior.

This estimator is motivated by robustness to arbitrary noise dependence (entrywise) in $W$ , eschewing typical independence or sub-Gaussian designs. The estimator is tight in both deterministic and minimax senses, with theoretical analysis grounded in a novel separation lemma for low-rank incoherent matrices.

3. Theoretical Guarantees and Rates

Theoretical results establish deterministic error bounds and minimax rate-optimality:

General Deterministic Bound: For any noise $W$ , if $\Delta_L = \widehat{L} - L^*$ , $\Delta_S = \widehat{S} - S^*$ , then

$\|\Delta_L\|_F^2 + \|\Delta_S\|_F^2 \leq \frac{128}{\kappa^2} ( r\|\mathfrak{X}^*(W)\|^2 + s\|\mathfrak{X}^*(W)\|_{\max}^2 ).$

For identity measurement ( $\mathfrak{X} = \mathrm{Id}$ ), $\kappa=1$ .

Stochastic Error under Markov Noise: For $\pi_{\max} = \max_i \pi^*_i$ $π_{m a x} = max_{i} π_{i}^{*}$ , $p_{\max} = \max_{i,j} P^*_{ij}$ $p_{m a x} = max_{i, j} P_{ij}^{*}$ , and mixing time $\tau^*$ $τ^{*}$ , with probability $\geq 1-n^{-c}$ $\geq 1 - n^{- c}$ :
- $\|W\| \leq C \sqrt{\pi_{\max} \tau^* (\log n)^2 / n}$
- $\|W\|_{\max} \leq C \sqrt{p_{\max} \pi_{\max} \tau^* (\log n)^2 / n}$ ,

for absolute constant $C$ .

Main Estimation Error Bounds: Provided $r=O(r), s=O(s)$ , and $n\geq Cp\tau^*(\log n)^2$ , with high probability,

$\|\widehat{F} - F^*\|_F \leq \sqrt{ c \pi_{\max} \tau^* (\log n)^2 / n \cdot ( r + p_{\max} s ) },$

and after row-normalizing,

$\|\widehat{P} - P^*\|_F \leq \frac{1}{\pi_{\min}} \|\widehat{F} - F^*\|_F, \quad \|\widehat{P} - P^*\|_1 \leq \frac{p}{\pi_{\min}} \|\widehat{F} - F^*\|_F.$

In the setting $\pi_{\min},\pi_{\max} = O(1/p)$ , $p_{\max} = O(1/p)$ , $\tau^* = O(1)$ , these specialize to

$\|\widehat{F}-F^*\|_F = O_p( \sqrt{ r/(n p) } ), \quad \|\widehat{P} - P^*\|_1 = O_p( \sqrt{ r p^3 / n } ),$

matching the minimax lower bounds attained by spectral estimators in the standard low-rank setting (Chai et al., 2024).

4. Separation Lemma for Incoherent Low-Rank Matrices

A central structural insight is encoded in the key separation lemma: For any two $\mu$ -incoherent rank- $r$ matrices $P, Q \in \mathbb{R}^{p \times p}$ ,

$\frac{ \|P-Q\|_{\max}^2 }{ \|P-Q\|_F^2 } \leq \frac{ \tilde{c} \mu r^4 }{ p }$

for universal constant $\tilde{c}$ . This asserts that the difference between two incoherent low-rank matrices cannot be "spiky," i.e., it cannot concentrate too much energy in a few entries. This lemma is instrumental in controlling cross-terms such as $\langle \Delta_L, \Delta_S \rangle$ in the theoretical analysis, thus enabling restricted strong convexity-type lower bounds. The proof proceeds by reduction to equal singular value and orthonormality cases, bounding factor inner products, and small linear programming over the singular spectrum.

5. Algorithmic Solution: Alternating Minimization

A practical approach to solving the structured recovery problem is an alternating minimization algorithm:

Sparse Update:

$S \leftarrow \mathcal{T}_s( F - U \Sigma V^T )$

where $\mathcal{T}_s$ applies a hard-threshold retaining only the $s$ largest entries.

Singular Value Update:

$\Sigma \leftarrow \mathrm{Diag}( U^T (F - S) V )$

Low-Rank Factors Update:

$U \leftarrow \arg\max_{U \in \mathcal{O}_{p,r}^\mu} \langle (F-S) V \Sigma, U \rangle$

(similarly for $V$ ).

Termination occurs when $\|S_{k+1} - S_k\|_F$ falls below a specified threshold or after 500 iterations. The per-step computational cost is $O(p^2 r + p^2 \log p)$ . Empirically, convergence is typically achieved in fewer than 10 rounds in both noiseless and noisy cases, for i.i.d. Gaussian as well as empirical-probability noise (Chai et al., 2024).

6. Extension to Reinforcement Learning Conditional Mean Estimation

The framework admits extension to estimate the conditional mean operator, a key quantity in reinforcement learning. For any random feature vector $v \in \mathbb{R}^p$ independent of chain data, $\mathbb{E}[vv^T] \preceq c_v I$ , the estimator $\widehat{T}(v) = \widehat{P} v$ obeys: $\mathbb{E}\|\widehat{P} v - P^* v\|_2^2 \leq c_v \mathbb{E}\|\widehat{P} - P^*\|_F^2 = O( c_v r p / n )$ This $rp/n$ rate improves dramatically over the worst-case $O(p^2/n)$ for fixed $v$ , underscoring the statistical benefits of random features and structured estimation in this domain.

7. Empirical Performance and Comparative Evaluation

Numerical experiments demonstrate:

Rapid Convergence: The alternating minimization algorithm converges to zero (noiseless) or noise floor (noisy) error in approximately 5–10 steps.
Error Scaling: The estimation error $\|\widehat{F} - F^*\|_F$ decays as $n^{-1/2}$ and $p^{-1/2}$ , aligned with theoretical predictions.
Practical Insensitivity to Incoherence Constraint: Empirical results show that imposing the incoherence constraint in every iteration alters performance minimally.
Comparative Accuracy: Against spectral estimators (e.g., from Zhang–Wang 2019), the constrained method yields substantially improved accuracy when the frequency matrix is low-rank plus sparse (Chai et al., 2024).

A plausible implication is that these improvements are prescriptive for high-dimensional Markov chain estimation tasks where real-world data exhibits both low-rank global structure and sparse, incoherent perturbations.

Markdown Report Issue Upgrade to Chat

References (1)

Structured Matrix Learning under Arbitrary Entrywise Dependence and Estimation of Markov Transition Kernel (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Transition Kernel Recovery.