Replicable RL Algorithms for Linear MDPs

Updated 17 September 2025

The paper introduces replicable RL algorithms in linear MDPs, using randomized rounding techniques (R-Hypergrid-Rounding) to achieve consistent value functions and policies.
It employs replicable primitives like R-Ridge-Regression and R-UC-Cov-Estimation, stabilizing function approximation and mitigating variance from stochastic updates.
The approach provides strong theoretical guarantees and empirical validation on benchmarks like CartPole, ensuring robust and scalable policy learning.

Replicable reinforcement learning (RL) algorithms for linear Markov decision processes (MDPs) constitute a critical research area addressing reliability, sample efficiency, and computational feasibility of RL agents trained under function approximation. Replicability, in this context, refers to the property that repeated executions of an RL algorithm on different datasets—drawn from the same underlying distribution and with shared algorithmic randomness—produce the same or nearly identical learned solutions. This property is particularly important in linear MDPs due to the high-dimensional, often unstable nature of function approximation and the desire for robust, scalable learning methods.

1. Foundations of Replicability in Linear MDPs

The formal notion of algorithmic replicability in reinforcement learning, as developed for linear MDPs, requires that for any pair of runs—with independent (but identically distributed) data and identical algorithmic random seeds—the output policy or value function remains unchanged with high probability, up to numerical precision (Eaton et al., 10 Sep 2025). While replicable algorithms have been constructed for tabular RL, extending these guarantees to the linear setting is significantly more challenging due to instability in regression and covariance estimation, compositional error propagation, and dependency on randomly constructed training samples and internal randomness.

A linear MDP is parameterized by feature mappings φ: S × A → ℝ^d for states S and actions A, with transitions and/or value functions assumed to be linear or approximately linear in φ. The replicability challenge lies in stabilizing the entire training trajectory, particularly in the presence of function approximation and random design regression.

2. Core Algorithmic Components for Replicability

Two core algorithmic primitives are instrumental for enabling replicability in linear-MDP-based RL:

Replicable Random Design Regression ("R-Ridge-Regression"): Given i.i.d. data {(x_i, y_i)}, the ridge regression solution

$\hat{w} = \left(\sum_{i=1}^N x_i x_i^\top + \lambda I\right)^{-1} \sum_{i=1}^N x_i y_i$

is postprocessed via R-Hypergrid-Rounding—a randomized grid-based rounding mechanism using a fixed, shared random shift. Each entry w_j in the weight vector is snapped to a grid of width α, ensuring that similar outputs produced across runs are both accurate and identical up to the rounding discretization (Eaton et al., 10 Sep 2025). This restricts differences in model weights arising from infinitesimal differences in the sample set or numerical artifacts.

Replicable Uncentered Covariance Estimation ("R-UC-Cov-Estimation"): For covariance estimation needed for exploration functions or UCB-style bonuses, the sample covariance

$\hat{\Sigma}_{jl} = \frac{1}{T} \sum_{t=1}^T x_{t}^{(j)} x_{t}^{(l)}$

is rounded elementwise on the upper triangle (again via R-Hypergrid-Rounding), and then mirrored to ensure symmetry and positive semidefiniteness. For any two estimates within Frobenius norm Δ, the probability that the rounded results coincide is at least $1 - d^2 \Delta/\alpha$ (where α is the grid width).

By integrating these primitives, RL algorithms such as value iteration or policy iteration produce replicable value functions and policies. For example, in "R-LSVI with core set", a replicable variant of least-squares value iteration is run on a burst core set of state-action pairs, using only rounded outputs from R-Ridge-Regression (Eaton et al., 10 Sep 2025).

3. Efficient Replicable RL Algorithms

Provably efficient replicable RL algorithms in the linear setting are constructed by careful composition of replicable regression, covariance estimation, and reinforcement learning procedures. Key examples include:

Replicable LSVI with Core Set: Approximate policy computation by running fitted Q-iteration (FQI) or least-squares value iteration (LSVI), restricting all regression and matrix estimation to the core set and rounding every output via the replicable procedures. The policy π^∗ is then defined (with parameter vector(s) snapped to the grid) and satisfies, with high probability, exact or near-exact output equality across runs under identical randomness.
Replicable LSVI-UCB: Where UCB-style exploration bonuses are applied, the Mahalanobis bonus term

$\beta \cdot \sqrt{\phi(s,a)^\top \bar{\Lambda}^{-1} \phi(s,a)}$

uses the rounded covariance estimate $\bar{\Lambda}$ , so that across two reruns, the bonus term is identical provided the pre-rounded matrices are within prescribed tolerance. This ensures that the learned Q-values and the resulting policy remain consistent, provided other elements (such as reward estimation) are also replicably rounded.

Theoretical analyses provide tight sample complexity bounds. For instance, for replicable ridge regression, to achieve ε-accuracy in the weights (in $\ell_2$ norm), it suffices to take

$N = \Omega\bigg(\frac{(B + Y)^2 d^3}{\lambda^2 \varepsilon^2 (\rho - 2\delta)^2} \log \frac{1}{\delta}\bigg)$

samples, where B and Y bound the covariates and labels, λ is the regularization parameter, δ the confidence, and ρ the replicability slack (Eaton et al., 10 Sep 2025). For R-LSVI with core set in a generative model with episode horizon H, state/action feature dimension d, and core set size k, the required number of samples scales as

$N = \Omega\left(\frac{d^6 k^3 H^{23}}{\varepsilon^8 (\rho-2\delta)^2} \log \frac{H}{\delta}\right).$

4. Theoretical and Experimental Guarantees

Rigorous theoretical results guarantee both statistical efficiency and replicability. If the data distribution and grid rounding are identical, the outputs of replicable algorithms (policies and value functions) coincide with high probability. Major theorems establish, for example, that the error between the rounded regression estimator and the exact ridge minimizer is negligible (in $\ell_2$ or prediction norm), and that for sufficiently large sample sizes and sufficiently fine grid width α, the final output is both accurate and highly likely to be exactly matching across repeat runs.

Empirical evaluation on environments such as CartPole and Atari (Breakout, MsPacman) confirms these guarantees: the “percentage of most common identical weight vector” approaches unity after a modest fraction of the total data is observed, and the mean return remains stable and competitive with standard (non-rounded) RL. For deep RL, directly rounding neural parameters is often impractical, so quantized Q-values or quantized network outputs are used as a replicability proxy. These quantized outputs, when combined with regularization, yield high action agreement and competitive returns.

5. Implications for RL Stability and Policy Consistency

The instability of traditional RL under function approximation—due to stochastic updates, sampling variance, or trajectory propagation—raises significant concerns for scientific reproducibility and deployment in safety-critical domains. The replicable RL algorithms presented address this challenge by ensuring that the solution space is discretized in a data-adaptive, random yet deterministic-with-respect-to-the-seed manner. As a result, repeated executions on new data drawn from the same process do not lead to policy drift.

Potential implications include enhanced testability of RL systems, increased robustness in training neural policies via replicated quantization, and foundational progress towards large-scale, function-approximation-based RL methods with formal consistency guarantees.

6. Methodological Innovations and Future Directions

The methodology underlying replicable RL in linear MDPs leverages randomized discretization at the level of estimation and matrix computation, compositionally stable RL algorithm architectures, and careful worst-case error control with explicit probabilistic bounds. The central innovation is the R-Hypergrid-Rounding procedure, which can be generalized or replaced by other stochastic rounding methods provided they admit coupling properties.

Open research directions include extending these results to nonlinear function approximation (e.g., deep RL), integrating replicable feature learning (since replicability currently presumes fixed features), and engineering practical replicable RL algorithms for continuous spaces. The techniques may also cross-inform the design of robust, stable algorithms in adjacent areas of statistical learning.

7. Comparative Summary: Technical and Practical Consequences

Algorithmic Component	Replicability Mechanism	Statistical Guarantee
R-Ridge-Regression	Grid rounding (R-Hypergrid)	$\\|\bar{w} - \theta^*\\| \le \varepsilon$
R-UC-Cov-Estimation	Grid rounding on covariance	$\\|\bar{\Sigma} - \Sigma^*\\|_F \le \delta$
Replicable LSVI (core set)	All train/test via R-steps	$\max_x \|x^\top(\bar{w} - \theta^*)\| \le \varepsilon$
Replicable LSVI-UCB (exploration bonus)	R-steps for bonus matrices	High-probability policy matching across runs

The practical upshot is that replicability in linear RL is no longer unattainable: the combination of replicable regression and covariance, composed within established RL procedures, produces high-confidence guarantees for outcome stability in algorithmic policy learning—an essential pillar for robust machine learning and real-world RL systems (Eaton et al., 10 Sep 2025).

PDF Markdown Chat (Pro)

References (1)

Replicable Reinforcement Learning with Linear Function Approximation (2025)

Follow Topic

Get notified by email when new papers are published related to Replicable RL Algorithms for Linear Markov Decision Processes.