Papers
Topics
Authors
Recent
Search
2000 character limit reached

Rank-One Modified Value Iteration (R1-VI)

Updated 4 April 2026
  • Rank-One Modified Value Iteration (R1-VI) is an algorithm that uses rank-one approximations of the transition kernel to accelerate planning and reinforcement learning in Markov Decision Processes.
  • It improves convergence by applying a rank-one deflation step—leveraging the stationary distribution—to reduce the error component aligned with the maximal eigenvalue.
  • R1-VI is effective in both model-based and model-free settings, offering substantial speedups especially in high discount-factor scenarios and well-connected MDPs.

Rank-One Modified Value Iteration (R1-VI) is a class of algorithms for accelerated planning and reinforcement learning in Markov Decision Processes (MDPs). The central paradigm is to enhance (policy or value) iteration by replacing the transition kernel in the policy evaluation step with a rank-one approximation, typically constructed using the stationary distribution of the policy-induced Markov chain. This modification yields provably faster convergence under favorable spectral properties, often at no additional asymptotic cost per iteration. The approach appears in both planning (model-based) and learning (model-free, e.g., Q-learning) contexts, and connects directly to theoretical developments in matrix splitting and deflation from numerical linear algebra.

1. Foundations: MDP Setup and Standard Value Iteration

Consider a finite, discounted MDP defined by:

  • State space S={1,2,,n}S = \{1,2,\ldots,n\}
  • Action space A={1,2,,m}A = \{1,2,\ldots,m\}
  • Transition kernel P(ss,a)P(s'|s,a), representing the probability of transitioning from state ss to ss' under action aa
  • Reward function r:S×ARr: S \times A \to \mathbb{R}
  • Discount factor γ(0,1)\gamma \in (0,1)

For any stationary deterministic policy π:SA\pi: S \to A, the induced transition matrix is Ps,sπ=P(ss,π(s))P^{\pi}_{s,s'} = P(s'|s, \pi(s)), and the value function A={1,2,,m}A = \{1,2,\ldots,m\}0 satisfies the Bellman equation: A={1,2,,m}A = \{1,2,\ldots,m\}1 where A={1,2,,m}A = \{1,2,\ldots,m\}2.

The classical planning problem is to compute the optimal value function

A={1,2,,m}A = \{1,2,\ldots,m\}3

with the Bellman optimality operator A={1,2,,m}A = \{1,2,\ldots,m\}4 defined as

A={1,2,,m}A = \{1,2,\ldots,m\}5

Standard value iteration (VI) proceeds by repeatedly applying A={1,2,,m}A = \{1,2,\ldots,m\}6: A={1,2,,m}A = \{1,2,\ldots,m\}7 This process contracts errors at rate A={1,2,,m}A = \{1,2,\ldots,m\}8 in A={1,2,,m}A = \{1,2,\ldots,m\}9 norm; convergence slows substantially as P(ss,a)P(s'|s,a)0.

2. Rank-One Approximation and Deflation in Policy Evaluation

Under mild ergodicity conditions, the stochastic matrix P(ss,a)P(s'|s,a)1 has a unique stationary distribution P(ss,a)P(s'|s,a)2 (i.e., P(ss,a)P(s'|s,a)3). The rank-one approximation is constructed as

P(ss,a)P(s'|s,a)4

with P(ss,a)P(s'|s,a)5 denoting the all-ones vector. This is the optimal rank-one approximation with respect to the spectral radius under irreducibility and aperiodicity (Kolarijani et al., 3 May 2025). In deflation-based methods, the leading eigenspace contribution is explicitly subtracted: P(ss,a)P(s'|s,a)6

The stationary distribution P(ss,a)P(s'|s,a)7 can be efficiently estimated using the Power Method: P(ss,a)P(s'|s,a)8 Empirically, P(ss,a)P(s'|s,a)9 power method step per iteration suffices if the policy does not change rapidly.

3. Rank-One Modified Value Iteration: Algorithmic Structure

Two functionally equivalent update paradigms exist: the “policy-iteration with rank-one evaluation” form (Kolarijani et al., 3 May 2025), and the matrix-deflation SOR splitting form (Lee et al., 2024). Both produce the same core iteration up to ordering of substeps.

Policy Iteration–Style R1-VI

At iteration ss0:

  1. Construct the greedy policy ss1 and build ss2.
  2. Estimate ss3 (one power step).
  3. Compute the Bellman update ss4.
  4. Apply the rank-one correction:

ss5

Matrix Splitting–Style R1-VI

Given ss6 such that ss7:

  1. Form ss8
  2. Compute ss9

This is a direct application of Sherman–Morrison or Woodbury identities to invert ss'0, enabling efficient closed-form correction at cost ss'1 per iteration, matching standard VI.

Pseudocode:

π:SA\pi: S \to A2 (Kolarijani et al., 3 May 2025, Lee et al., 2024)

4. Theoretical Guarantees and Convergence Analysis

For policy evaluation, the rank-one deflation method removes the error component aligned with the top eigenspace. Spectral arguments yield: ss'2 where ss'3 is the subdominant eigenvalue of ss'4. This yields exponentially faster convergence than standard VI (ss'5) whenever ss'6 (Lee et al., 2024).

In the control (greedy-improvement) context, the R1-VI iterates ss'7 converge to the unique Bellman fixed point ss'8 at least at linear rate ss'9: aa0 [(Kolarijani et al., 3 May 2025), Theorem 3.1]. The per-iteration computational complexity is aa1.

5. Extension to Q-Learning and Model-Free Settings

The rank-one update concept extends to Q-learning by approximating the state-action transition kernel aa2 with a rank-one matrix: aa3 with rank-one approximation aa4, aa5.

For the model-free setting, empirical Bellman operators are sampled, and the Rank-One Q-Learning (R1-QL) procedure is as follows: aa6 with aa7 a Robbins–Monro step size [(Kolarijani et al., 3 May 2025), Algorithm 2]. Under standard stochastic approximation assumptions, aa8 almost surely at the same sample complexity as classical Q-learning.

6. Empirical Performance and Practical Considerations

Experimental benchmarks include Garnet MDPs and random graph MDPs (Kolarijani et al., 3 May 2025), as well as chain-walk and grid-maze tasks (Lee et al., 2024).

Observed results:

  • R1-VI and its rank-one deflation variants converge nearly as fast as full policy iteration.
  • Substantially fewer iterations are required than VI, Nesterov-VI, Anderson-VI, or Speedy-Q-learning, particularly as aa9.
  • In model-free settings, R1-QL at least matches or outperforms Speedy-Q, Zap-Q, and standard Q-Learning for both Bellman error and value error.

Implementation and cost:

  • To estimate the stationary distribution r:S×ARr: S \times A \to \mathbb{R}0 (acting as the deflation vector r:S×ARr: S \times A \to \mathbb{R}1), one can use one-step power method per iteration.
  • The incremental overhead per iteration is limited to one matrix-vector multiplication and a few inner products.
  • Periodic re-estimation or smoothing of r:S×ARr: S \times A \to \mathbb{R}2 can control potential instability when r:S×ARr: S \times A \to \mathbb{R}3 changes rapidly.

Table: Empirical Comparison (per (Kolarijani et al., 3 May 2025) and (Lee et al., 2024))

Method Contraction Rate Per-Iteration Complexity Practical Speedup
Standard VI r:S×ARr: S \times A \to \mathbb{R}4 r:S×ARr: S \times A \to \mathbb{R}5 Baseline
R1-VI / Defl. r:S×ARr: S \times A \to \mathbb{R}6 r:S×ARr: S \times A \to \mathbb{R}7 Substantial, esp. r:S×ARr: S \times A \to \mathbb{R}8
Policy Iter. Superlinear (PI) r:S×ARr: S \times A \to \mathbb{R}9 Fastest, costly

7. Recommendations, Limitations, and Generalization

R1-VI offers maximal acceleration when:

  • The discount factor γ(0,1)\gamma \in (0,1)0 is close to 1
  • The Markov chain induced by the current policy is well-connected (irreducible, aperiodic), ensuring effective estimation of γ(0,1)\gamma \in (0,1)1
  • The spectral gap γ(0,1)\gamma \in (0,1)2 is non-negligible

If the Markov chain is reducible or nearly so, the stationary distribution γ(0,1)\gamma \in (0,1)3 may be highly concentrated, causing the rank-one correction to lose efficacy; in this regime, increasing the number of power iterations (from γ(0,1)\gamma \in (0,1)4 to γ(0,1)\gamma \in (0,1)5 or γ(0,1)\gamma \in (0,1)6) or adding regularization is suggested [(Kolarijani et al., 3 May 2025), Section 6.3].

For policy control tasks with changing γ(0,1)\gamma \in (0,1)7, recomputing γ(0,1)\gamma \in (0,1)8 or γ(0,1)\gamma \in (0,1)9 every few iterations suffices. A fixed π:SA\pi: S \to A0 taken from the ultimate policy also yields benefits for policy evaluation phases in control-VI (Lee et al., 2024).

Memory requirements are minimal, requiring storage only of π:SA\pi: S \to A1 and the current greedy action indices.

In summary, Rank-One Modified Value Iteration accelerates convergence in both planning and learning at the same per-iteration complexity as classical first-order methods, and is strongly favored when the transition spectrum has pronounced spectral gap below the leading eigenvalue (Kolarijani et al., 3 May 2025, Lee et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Rank-One Modified Value Iteration (R1-VI).