Rank-One Modified Value Iteration (R1-VI)
- Rank-One Modified Value Iteration (R1-VI) is an algorithm that uses rank-one approximations of the transition kernel to accelerate planning and reinforcement learning in Markov Decision Processes.
- It improves convergence by applying a rank-one deflation step—leveraging the stationary distribution—to reduce the error component aligned with the maximal eigenvalue.
- R1-VI is effective in both model-based and model-free settings, offering substantial speedups especially in high discount-factor scenarios and well-connected MDPs.
Rank-One Modified Value Iteration (R1-VI) is a class of algorithms for accelerated planning and reinforcement learning in Markov Decision Processes (MDPs). The central paradigm is to enhance (policy or value) iteration by replacing the transition kernel in the policy evaluation step with a rank-one approximation, typically constructed using the stationary distribution of the policy-induced Markov chain. This modification yields provably faster convergence under favorable spectral properties, often at no additional asymptotic cost per iteration. The approach appears in both planning (model-based) and learning (model-free, e.g., Q-learning) contexts, and connects directly to theoretical developments in matrix splitting and deflation from numerical linear algebra.
1. Foundations: MDP Setup and Standard Value Iteration
Consider a finite, discounted MDP defined by:
- State space
- Action space
- Transition kernel , representing the probability of transitioning from state to under action
- Reward function
- Discount factor
For any stationary deterministic policy , the induced transition matrix is , and the value function 0 satisfies the Bellman equation: 1 where 2.
The classical planning problem is to compute the optimal value function
3
with the Bellman optimality operator 4 defined as
5
Standard value iteration (VI) proceeds by repeatedly applying 6: 7 This process contracts errors at rate 8 in 9 norm; convergence slows substantially as 0.
2. Rank-One Approximation and Deflation in Policy Evaluation
Under mild ergodicity conditions, the stochastic matrix 1 has a unique stationary distribution 2 (i.e., 3). The rank-one approximation is constructed as
4
with 5 denoting the all-ones vector. This is the optimal rank-one approximation with respect to the spectral radius under irreducibility and aperiodicity (Kolarijani et al., 3 May 2025). In deflation-based methods, the leading eigenspace contribution is explicitly subtracted: 6
The stationary distribution 7 can be efficiently estimated using the Power Method: 8 Empirically, 9 power method step per iteration suffices if the policy does not change rapidly.
3. Rank-One Modified Value Iteration: Algorithmic Structure
Two functionally equivalent update paradigms exist: the “policy-iteration with rank-one evaluation” form (Kolarijani et al., 3 May 2025), and the matrix-deflation SOR splitting form (Lee et al., 2024). Both produce the same core iteration up to ordering of substeps.
Policy Iteration–Style R1-VI
At iteration 0:
- Construct the greedy policy 1 and build 2.
- Estimate 3 (one power step).
- Compute the Bellman update 4.
- Apply the rank-one correction:
5
Matrix Splitting–Style R1-VI
Given 6 such that 7:
- Form 8
- Compute 9
This is a direct application of Sherman–Morrison or Woodbury identities to invert 0, enabling efficient closed-form correction at cost 1 per iteration, matching standard VI.
Pseudocode:
2 (Kolarijani et al., 3 May 2025, Lee et al., 2024)
4. Theoretical Guarantees and Convergence Analysis
For policy evaluation, the rank-one deflation method removes the error component aligned with the top eigenspace. Spectral arguments yield: 2 where 3 is the subdominant eigenvalue of 4. This yields exponentially faster convergence than standard VI (5) whenever 6 (Lee et al., 2024).
In the control (greedy-improvement) context, the R1-VI iterates 7 converge to the unique Bellman fixed point 8 at least at linear rate 9: 0 [(Kolarijani et al., 3 May 2025), Theorem 3.1]. The per-iteration computational complexity is 1.
5. Extension to Q-Learning and Model-Free Settings
The rank-one update concept extends to Q-learning by approximating the state-action transition kernel 2 with a rank-one matrix: 3 with rank-one approximation 4, 5.
For the model-free setting, empirical Bellman operators are sampled, and the Rank-One Q-Learning (R1-QL) procedure is as follows: 6 with 7 a Robbins–Monro step size [(Kolarijani et al., 3 May 2025), Algorithm 2]. Under standard stochastic approximation assumptions, 8 almost surely at the same sample complexity as classical Q-learning.
6. Empirical Performance and Practical Considerations
Experimental benchmarks include Garnet MDPs and random graph MDPs (Kolarijani et al., 3 May 2025), as well as chain-walk and grid-maze tasks (Lee et al., 2024).
Observed results:
- R1-VI and its rank-one deflation variants converge nearly as fast as full policy iteration.
- Substantially fewer iterations are required than VI, Nesterov-VI, Anderson-VI, or Speedy-Q-learning, particularly as 9.
- In model-free settings, R1-QL at least matches or outperforms Speedy-Q, Zap-Q, and standard Q-Learning for both Bellman error and value error.
Implementation and cost:
- To estimate the stationary distribution 0 (acting as the deflation vector 1), one can use one-step power method per iteration.
- The incremental overhead per iteration is limited to one matrix-vector multiplication and a few inner products.
- Periodic re-estimation or smoothing of 2 can control potential instability when 3 changes rapidly.
Table: Empirical Comparison (per (Kolarijani et al., 3 May 2025) and (Lee et al., 2024))
| Method | Contraction Rate | Per-Iteration Complexity | Practical Speedup |
|---|---|---|---|
| Standard VI | 4 | 5 | Baseline |
| R1-VI / Defl. | 6 | 7 | Substantial, esp. 8 |
| Policy Iter. | Superlinear (PI) | 9 | Fastest, costly |
7. Recommendations, Limitations, and Generalization
R1-VI offers maximal acceleration when:
- The discount factor 0 is close to 1
- The Markov chain induced by the current policy is well-connected (irreducible, aperiodic), ensuring effective estimation of 1
- The spectral gap 2 is non-negligible
If the Markov chain is reducible or nearly so, the stationary distribution 3 may be highly concentrated, causing the rank-one correction to lose efficacy; in this regime, increasing the number of power iterations (from 4 to 5 or 6) or adding regularization is suggested [(Kolarijani et al., 3 May 2025), Section 6.3].
For policy control tasks with changing 7, recomputing 8 or 9 every few iterations suffices. A fixed 0 taken from the ultimate policy also yields benefits for policy evaluation phases in control-VI (Lee et al., 2024).
Memory requirements are minimal, requiring storage only of 1 and the current greedy action indices.
In summary, Rank-One Modified Value Iteration accelerates convergence in both planning and learning at the same per-iteration complexity as classical first-order methods, and is strongly favored when the transition spectrum has pronounced spectral gap below the leading eigenvalue (Kolarijani et al., 3 May 2025, Lee et al., 2024).