Linear Policy Approximations

Updated 25 November 2025

Linear policy approximations are models that represent policies or value functions as a linear combination of fixed features, offering computational tractability and precise error bounds.
They are applied in offline evaluation, policy gradients, and model-free control through techniques like FQI, LSTD, and Natural Policy Gradient.
Theoretical guarantees including stability, invertibility, and sample efficiency provide clear criteria for effective deployment in RL and control.

Linear policy approximations are a foundational class of methods in reinforcement learning (RL) and control, characterized by representing policies (or value functions) as linear parameterizations over fixed feature maps. These methods enable computational, theoretical, and statistical tractability in both classical and modern RL contexts, including model-based control, policy-gradient optimization, and offline policy evaluation. Linear architectures, combined with tailored algebraic and control-theoretic conditions, underpin much of the finite-sample analysis and algorithmic guarantees for a broad range of policy evaluation and improvement routines.

1. Problem Setting and Linear Parameterizations

A Markov Decision Process (MDP) is specified by a tuple $(S, A, P, R, \gamma)$ , where $S$ is the (possibly large or infinite) state space, $A$ is the action space, $P$ is the transition kernel, $R$ is the reward function, and $0 < \gamma < 1$ is the discount factor. The target is to evaluate or optimize the value function or policy given either online or offline data.

Linear function approximation instantiates value or Q-functions as $Q(s, a) = \phi(s, a)^\top w$ for a feature map $\phi: S \times A \rightarrow \mathbb{R}^d$ and parameter vector $w \in \mathbb{R}^d$ , where $d \ll |S| \cdot |A|$ . For policy-based methods, "log-linear" (softmax) parameterizations take the form

$\pi_\theta(a|s) = \frac{\exp(\phi(s, a)^\top \theta)}{ \sum_{a'} \exp(\phi(s, a')^\top \theta) },$

where $\theta \in \mathbb{R}^d$ parameterizes the policy. This linear structure is key for statistical efficiency, computational simplicity, and explicit error analysis across settings from online policy optimization to model-free control and partially observed MDPs (Perdomo et al., 2022, Alfano et al., 2022, Yuan et al., 2022, Srikanth, 27 May 2024, Kara, 20 May 2025).

2. Algorithms: Value-Based and Policy-Based Linear Methods

2.1 Value Function Approximations: FQI and LSTD

In offline policy evaluation, common linear estimators include Fitted Q-Iteration (FQI) and Least Squares Temporal Difference (LSTD):

FQI: Performs iterative policy evaluation via

$w_{k+1} = \gamma \Sigma^{-1} \Sigma_{\text{next}} w_k + \Sigma^{-1} \theta_{\phi, r},$

where empirical or population covariance matrices $\Sigma = \mathbb{E}_D[\phi\phi^\top]$ and $\Sigma_{\text{next}} = \mathbb{E}_{(s, a), s', a'} [\phi(s, a) \phi(s', a')^\top]$ are defined with respect to the data distribution $D$ and future state-action pairs. Convergence to the true weights $w^\star$ occurs under algebraic stability conditions.

LSTD: Directly solves the linear system

$(\Sigma - \gamma \Sigma_{\text{next}}) w = \theta_{\phi, r},$

which corresponds to a projected Bellman equation in feature space. This method is computationally efficient and admits well-characterized conditions for consistency.

2.2 Policy Gradient and Log-Linear Policies

Natural Policy Gradient (NPG) with log-linear policy parametrization leverages the geometry of policy space:

The natural-gradient update for softmax policies with linear features is given by

$\theta_{k+1} = \theta_k + \eta_k \, w_k,$

where $w_k$ is the compatible parameter obtained from regressions of $Q$ -values against policy score features, and $\eta_k$ is a geometrically increasing step size.

NPG and Q-NPG updates are equivalent to mirror descent with Kullback-Leibler (KL) divergence. The resulting algorithm enjoys linear convergence up to an explicit error floor governed by the bias and statistical error in the value function regression, distribution mismatch between exploration and evaluation, and the condition number of the feature covariance matrix (Alfano et al., 2022, Yuan et al., 2022, Srikanth, 27 May 2024).

2.3 Model-Free Linear Control

In linear-quadratic regulator (LQR) settings, Linear Policy Iteration (PI/LSPI) alternates between:

Policy Evaluation: Utilizing LSTD-Q to fit $Q^\pi(x,u) = \phi(x,u)^\top q$ for quadratic features $\phi(x,u)$ ,
Policy Improvement: Computing a new linear state-feedback controller $K_{t+1} = -Q_{22}^{-1} Q_{12}^\top$ .

Finite sample and regret bounds are derived explicitly in terms of the state and action dimensions (Krauth et al., 2019).

3. Necessary and Sufficient Conditions for Linear Policy Evaluation

A complete characterization of the regimes where linear estimators are tractable in offline RL has been established:

Stability (FQI-Solvability): The whitened Bellman operator $A_w = \gamma \Sigma^{-1/2} \Sigma_{\text{next}} \Sigma^{-1/2}$ must satisfy spectral radius $\rho(A_w) < 1$ . This is necessary and sufficient for FQI convergence.
Invertibility (LSTD Identifiability): The matrix $\Sigma - \gamma \Sigma_{\text{next}}$ must be full rank. This is strictly weaker than stability and is both necessary and sufficient for consistency of any linear-moment matching method.
Lower Bound: If $\Sigma - \gamma \Sigma_{\text{next}}$ is singular, no moment-based linear estimator can identify the true value function even with infinite data (for a fixed set of low-order moments).

The relationships among these regimes can be summarized in the following inclusion diagram:

Condition Type	Implies	Strictness Order
Stability (FQI)	Invertibility	Stability ⊂ Invertibility
Invertibility	Solvability

Examples include settings where LSTD works but FQI diverges (invertible but not stable), and cases where no linear method can succeed due to rank-deficient observation distributions (Perdomo et al., 2022).

4. Sample Complexity and Statistical Error Analysis

Instance-dependent and regime-aware sample complexity guarantees replace classical worst-case bounds:

FQI: Errors scale as $O(\text{cond}(P_\gamma)\|P_\gamma\| \sqrt{d/n} + \exp(-T/\cdots))$ with $P_\gamma$ the Lyapunov solution to the associated Bellman operator.
LSTD: Errors scale as $O((\sigma_{\min}(I-A_w))^{-2} \sqrt{d/n})$ , emphasizing the role of the smallest singular value of the projected Bellman map.
NPG/Q-NPG: Both enjoy linear convergence up to an error floor $O(\sqrt{\varepsilon_{\text{stat}}} + \sqrt{\varepsilon_{\text{approx}}})$ with overall sample complexity $\tilde{O}(1/\epsilon^2)$ per outer iteration (Alfano et al., 2022, Yuan et al., 2022).

Distribution mismatch—quantified via visitation density ratios and feature coverage—directly affects convergence rates and statistical error.

In the context of LQR, model-free approximate PI achieves sample complexity $O((n+d)^3 \varepsilon^{-2} \log(1/\varepsilon))$ to yield an $\varepsilon$ -optimal controller, with the policy evaluation phase being the sample bottleneck (Krauth et al., 2019).

5. Extensions: Structural and Algorithmic Variants

5.1 First-Order and Relational MDPs

Linear policy/value representations generalize to first-order MDPs (FOMDPs), where basis functions are first-order logical case-statements over objects and relations. Approximate Linear Programming (FOALP) and Approximate Policy Iteration (FOAPI) maintain symbolic linear programs, automatically generate new bases via regression and greedy heuristics, and use compact decomposition for universally quantified rewards (Sanner et al., 2012).

5.2 Basis Generation and State-Relevance

Self-guided ALPs iteratively enrich the linear architecture via random feature sampling and monotonic guiding constraints, providing non-decreasing lower bounds on the optimal value and improved worst-case policy gaps. Theoretical results guarantee $O(1/\sqrt{N})$ approximation error, with high-dimensional applications in inventory control and option pricing (Pakiman et al., 2020).

5.3 Partially Observed MDPs

Partial observability can be addressed with finite-memory policies and linear approximators over feature maps of observation–action histories. Convergence of value and $Q$ -learning holds under filter-stability and appropriately relaxed covariance conditions, with special cases (perfect linearity, discretized basis) removing some requirements on the exploration policy (Kara, 20 May 2025).

6. Empirical Performance and Computational Considerations

Recent empirical work demonstrates that in low-dimensional, discrete action MDPs:

Log-linear NPG with compatible linear critics matches or outperforms neural architectures such as TRPO/PPO, both in sample efficiency and compute time, provided that features are well designed.
Linear approximations yield robustness to moderate observation noise and are computationally preferable when hand-crafted features are available, with significant speedups noted over neural network approaches in classical domains like CartPole and Acrobot (Srikanth, 27 May 2024).

A plausible implication is that, for small to moderate state/action spaces or settings where high-quality features can be engineered, linear policy approximation provides a strong tradeoff between practical performance and theoretical tractability.

7. Limitations and Open Directions

While linear approximations deliver strong statistical and computational properties, their expressivity is fundamentally limited by the choice of features. Manual basis construction is often necessary and performance degrades in highly nonlinear or large-scale domains unless supplemented by rich feature engineering or randomized bases.

Open directions include:

Automatic, data-driven or adaptive basis selection schemes,
Integration with nonlinear architectures, retaining the convexity properties of linear formulations,
Scalable variants for high-dimensional or structured domains, and
Theoretical analysis for hybrid or deep value architectures informed by the linear case (Pakiman et al., 2020, Kara, 20 May 2025).