Linear Policy Approximations
- Linear policy approximations are models that represent policies or value functions as a linear combination of fixed features, offering computational tractability and precise error bounds.
- They are applied in offline evaluation, policy gradients, and model-free control through techniques like FQI, LSTD, and Natural Policy Gradient.
- Theoretical guarantees including stability, invertibility, and sample efficiency provide clear criteria for effective deployment in RL and control.
Linear policy approximations are a foundational class of methods in reinforcement learning (RL) and control, characterized by representing policies (or value functions) as linear parameterizations over fixed feature maps. These methods enable computational, theoretical, and statistical tractability in both classical and modern RL contexts, including model-based control, policy-gradient optimization, and offline policy evaluation. Linear architectures, combined with tailored algebraic and control-theoretic conditions, underpin much of the finite-sample analysis and algorithmic guarantees for a broad range of policy evaluation and improvement routines.
1. Problem Setting and Linear Parameterizations
A Markov Decision Process (MDP) is specified by a tuple , where is the (possibly large or infinite) state space, is the action space, is the transition kernel, is the reward function, and is the discount factor. The target is to evaluate or optimize the value function or policy given either online or offline data.
Linear function approximation instantiates value or Q-functions as for a feature map and parameter vector , where . For policy-based methods, "log-linear" (softmax) parameterizations take the form
where parameterizes the policy. This linear structure is key for statistical efficiency, computational simplicity, and explicit error analysis across settings from online policy optimization to model-free control and partially observed MDPs (Perdomo et al., 2022, Alfano et al., 2022, Yuan et al., 2022, Srikanth, 27 May 2024, Kara, 20 May 2025).
2. Algorithms: Value-Based and Policy-Based Linear Methods
2.1 Value Function Approximations: FQI and LSTD
In offline policy evaluation, common linear estimators include Fitted Q-Iteration (FQI) and Least Squares Temporal Difference (LSTD):
- FQI: Performs iterative policy evaluation via
where empirical or population covariance matrices and are defined with respect to the data distribution and future state-action pairs. Convergence to the true weights occurs under algebraic stability conditions.
- LSTD: Directly solves the linear system
which corresponds to a projected Bellman equation in feature space. This method is computationally efficient and admits well-characterized conditions for consistency.
2.2 Policy Gradient and Log-Linear Policies
Natural Policy Gradient (NPG) with log-linear policy parametrization leverages the geometry of policy space:
- The natural-gradient update for softmax policies with linear features is given by
where is the compatible parameter obtained from regressions of -values against policy score features, and is a geometrically increasing step size.
- NPG and Q-NPG updates are equivalent to mirror descent with Kullback-Leibler (KL) divergence. The resulting algorithm enjoys linear convergence up to an explicit error floor governed by the bias and statistical error in the value function regression, distribution mismatch between exploration and evaluation, and the condition number of the feature covariance matrix (Alfano et al., 2022, Yuan et al., 2022, Srikanth, 27 May 2024).
2.3 Model-Free Linear Control
In linear-quadratic regulator (LQR) settings, Linear Policy Iteration (PI/LSPI) alternates between:
- Policy Evaluation: Utilizing LSTD-Q to fit for quadratic features ,
- Policy Improvement: Computing a new linear state-feedback controller .
Finite sample and regret bounds are derived explicitly in terms of the state and action dimensions (Krauth et al., 2019).
3. Necessary and Sufficient Conditions for Linear Policy Evaluation
A complete characterization of the regimes where linear estimators are tractable in offline RL has been established:
- Stability (FQI-Solvability): The whitened Bellman operator must satisfy spectral radius . This is necessary and sufficient for FQI convergence.
- Invertibility (LSTD Identifiability): The matrix must be full rank. This is strictly weaker than stability and is both necessary and sufficient for consistency of any linear-moment matching method.
- Lower Bound: If is singular, no moment-based linear estimator can identify the true value function even with infinite data (for a fixed set of low-order moments).
The relationships among these regimes can be summarized in the following inclusion diagram:
| Condition Type | Implies | Strictness Order |
|---|---|---|
| Stability (FQI) | Invertibility | Stability ⊂ Invertibility |
| Invertibility | Solvability |
Examples include settings where LSTD works but FQI diverges (invertible but not stable), and cases where no linear method can succeed due to rank-deficient observation distributions (Perdomo et al., 2022).
4. Sample Complexity and Statistical Error Analysis
Instance-dependent and regime-aware sample complexity guarantees replace classical worst-case bounds:
- FQI: Errors scale as with the Lyapunov solution to the associated Bellman operator.
- LSTD: Errors scale as , emphasizing the role of the smallest singular value of the projected Bellman map.
- NPG/Q-NPG: Both enjoy linear convergence up to an error floor with overall sample complexity per outer iteration (Alfano et al., 2022, Yuan et al., 2022).
Distribution mismatch—quantified via visitation density ratios and feature coverage—directly affects convergence rates and statistical error.
In the context of LQR, model-free approximate PI achieves sample complexity to yield an -optimal controller, with the policy evaluation phase being the sample bottleneck (Krauth et al., 2019).
5. Extensions: Structural and Algorithmic Variants
5.1 First-Order and Relational MDPs
Linear policy/value representations generalize to first-order MDPs (FOMDPs), where basis functions are first-order logical case-statements over objects and relations. Approximate Linear Programming (FOALP) and Approximate Policy Iteration (FOAPI) maintain symbolic linear programs, automatically generate new bases via regression and greedy heuristics, and use compact decomposition for universally quantified rewards (Sanner et al., 2012).
5.2 Basis Generation and State-Relevance
Self-guided ALPs iteratively enrich the linear architecture via random feature sampling and monotonic guiding constraints, providing non-decreasing lower bounds on the optimal value and improved worst-case policy gaps. Theoretical results guarantee approximation error, with high-dimensional applications in inventory control and option pricing (Pakiman et al., 2020).
5.3 Partially Observed MDPs
Partial observability can be addressed with finite-memory policies and linear approximators over feature maps of observation–action histories. Convergence of value and -learning holds under filter-stability and appropriately relaxed covariance conditions, with special cases (perfect linearity, discretized basis) removing some requirements on the exploration policy (Kara, 20 May 2025).
6. Empirical Performance and Computational Considerations
Recent empirical work demonstrates that in low-dimensional, discrete action MDPs:
- Log-linear NPG with compatible linear critics matches or outperforms neural architectures such as TRPO/PPO, both in sample efficiency and compute time, provided that features are well designed.
- Linear approximations yield robustness to moderate observation noise and are computationally preferable when hand-crafted features are available, with significant speedups noted over neural network approaches in classical domains like CartPole and Acrobot (Srikanth, 27 May 2024).
A plausible implication is that, for small to moderate state/action spaces or settings where high-quality features can be engineered, linear policy approximation provides a strong tradeoff between practical performance and theoretical tractability.
7. Limitations and Open Directions
While linear approximations deliver strong statistical and computational properties, their expressivity is fundamentally limited by the choice of features. Manual basis construction is often necessary and performance degrades in highly nonlinear or large-scale domains unless supplemented by rich feature engineering or randomized bases.
Open directions include:
- Automatic, data-driven or adaptive basis selection schemes,
- Integration with nonlinear architectures, retaining the convexity properties of linear formulations,
- Scalable variants for high-dimensional or structured domains, and
- Theoretical analysis for hybrid or deep value architectures informed by the linear case (Pakiman et al., 2020, Kara, 20 May 2025).