Papers
Topics
Authors
Recent
Search
2000 character limit reached

Doubly-Asynchronous Value Iteration

Updated 4 April 2026
  • Doubly-Asynchronous Value Iteration (DAVI) is a dynamic programming approach that asynchronously updates both states and sampled actions to efficiently solve large-scale Markov Decision Processes.
  • It introduces asynchrony in both state and action updates by performing in-place Bellman updates on randomly selected states and user-specified action subsets, reducing per-update computational cost.
  • Careful selection of the action-subset size and sampling strategies enables DAVI to maintain almost-sure convergence and near-geometric rates while demonstrating robust empirical performance across diverse MDP scenarios.

Doubly-Asynchronous Value Iteration (DAVI) is a generalization of classical Value Iteration (VI) and Asynchronous VI (AVI) designed to address the computational challenges of dynamic programming in Markov Decision Processes (MDPs) with large state and action spaces. In contrast to VI, which synchronously updates all states and maximizes over the entire action space, and AVI, which introduces asynchrony over the state updates but retains full maximization over actions, DAVI introduces asynchrony over both states and actions. Specifically, DAVI performs in-place Bellman updates at randomly sampled states while maximizing over user-specified sampled subsets of actions. This approach asymptotically retains the desirable properties of VI—almost-sure convergence to the optimal value function, a near-geometric convergence rate with high probability, and computation time nearly matching established complexity bounds—while reducing per-update computational cost in large action domains (Tian et al., 2022).

1. Algorithmic Structure and Pseudocode

Let (S,A,r,p,γ)(\mathcal S,\mathcal A,r,p,\gamma) denote a discounted MDP with finite state space ∣S∣=S|\mathcal S|=S, action space ∣A∣=A|\mathcal A|=A, bounded reward r(s,a)∈[0,1]r(s,a) \in [0,1], and discount factor γ∈[0,1)\gamma \in [0,1). The Bellman look-ahead is defined as Lv(s,a)≡r(s,a)+γ∑s′p(s′∣s,a)v(s′)L^v(s,a) \equiv r(s,a) + \gamma \sum_{s'} p(s'|s,a) v(s').

DAVI operates iteratively as follows:

  • At iteration nn, sample sn∼p(â‹…)s_n \sim p(\cdot) from a state-sampling distribution.
  • Sample subset An⊆AA_n \subseteq \mathcal A of size mm, using distribution ∣S∣=S|\mathcal S|=S0.
  • Compute ∣S∣=S|\mathcal S|=S1, where ∣S∣=S|\mathcal S|=S2 is the current "best-so-far" action for ∣S∣=S|\mathcal S|=S3.
  • Update the value function:

∣S∣=S|\mathcal S|=S4

  • Update the policy at ∣S∣=S|\mathcal S|=S5: ∣S∣=S|\mathcal S|=S6 if ∣S∣=S|\mathcal S|=S7, else ∣S∣=S|\mathcal S|=S8.

Pseudocode:

sn∼p(⋅)s_n \sim p(\cdot)0

Every state-action pair ∣S∣=S|\mathcal S|=S9 must have nonzero probability of being sampled: ∣A∣=A|\mathcal A|=A0.

2. Theoretical Convergence Properties

Convergence results under DAVI are established with the following key assumptions:

  • Rewards are bounded: ∣A∣=A|\mathcal A|=A1.
  • Discount ∣A∣=A|\mathcal A|=A2.
  • State sampling ∣A∣=A|\mathcal A|=A3 and joint sampling ∣A∣=A|\mathcal A|=A4 for all ∣A∣=A|\mathcal A|=A5.
  • ∣A∣=A|\mathcal A|=A6 initialized as all zeros, all negative constant, or ∣A∣=A|\mathcal A|=A7.

(a) Almost-sure convergence: With probability 1, ∣A∣=A|\mathcal A|=A8 as ∣A∣=A|\mathcal A|=A9. This is grounded in value monotonicity (r(s,a)∈[0,1]r(s,a) \in [0,1]0), boundedness (r(s,a)∈[0,1]r(s,a) \in [0,1]1), and the fact that the update sequence almost surely covers every r(s,a)∈[0,1]r(s,a) \in [0,1]2, resulting in contraction towards r(s,a)∈[0,1]r(s,a) \in [0,1]3.

(b) Near-geometric convergence rate: For any fixed r(s,a)∈[0,1]r(s,a) \in [0,1]4, let r(s,a)∈[0,1]r(s,a) \in [0,1]5 and r(s,a)∈[0,1]r(s,a) \in [0,1]6. With probability at least r(s,a)∈[0,1]r(s,a) \in [0,1]7, after

r(s,a)∈[0,1]r(s,a) \in [0,1]8

iterations, r(s,a)∈[0,1]r(s,a) \in [0,1]9.

3. Computational Complexity and Sample Efficiency

Each update involves γ∈[0,1)\gamma \in [0,1)0 look-ahead computations per iteration (for γ∈[0,1)\gamma \in [0,1)1 sampled actions plus the "best-so-far" action). With only one state updated per iteration, cost is γ∈[0,1)\gamma \in [0,1)2. To ensure an γ∈[0,1)\gamma \in [0,1)3-optimal policy with probability γ∈[0,1)\gamma \in [0,1)4 requires

γ∈[0,1)\gamma \in [0,1)5

iterations, yielding γ∈[0,1)\gamma \in [0,1)6 total look-aheads. For uniform action sampling γ∈[0,1)\gamma \in [0,1)7, the resulting bound matches the classic γ∈[0,1)\gamma \in [0,1)8 value iteration bound up to logarithmic factors. Lower bounds of γ∈[0,1)\gamma \in [0,1)9 apply to both DAVI and AVI, indicating near-optimality up to these log terms.

4. Parameter Selection, Sampling, and Update Scheduling

  • Action-subset size Lv(s,a)≡r(s,a)+γ∑s′p(s′∣s,a)v(s′)L^v(s,a) \equiv r(s,a) + \gamma \sum_{s'} p(s'|s,a) v(s')0: There is a trade-off between per-update cost (proportional to Lv(s,a)≡r(s,a)+γ∑s′p(s′∣s,a)v(s′)L^v(s,a) \equiv r(s,a) + \gamma \sum_{s'} p(s'|s,a) v(s')1) and probability of sampling near-optimal actions (Lv(s,a)≡r(s,a)+γ∑s′p(s′∣s,a)v(s′)L^v(s,a) \equiv r(s,a) + \gamma \sum_{s'} p(s'|s,a) v(s')2). In practice, Lv(s,a)≡r(s,a)+γ∑s′p(s′∣s,a)v(s′)L^v(s,a) \equiv r(s,a) + \gamma \sum_{s'} p(s'|s,a) v(s')3 is typical to reduce computational burden, though Lv(s,a)≡r(s,a)+γ∑s′p(s′∣s,a)v(s′)L^v(s,a) \equiv r(s,a) + \gamma \sum_{s'} p(s'|s,a) v(s')4 must be large enough to deliver sufficient probability of selecting near-optimal actions. Choices such as Lv(s,a)≡r(s,a)+γ∑s′p(s′∣s,a)v(s′)L^v(s,a) \equiv r(s,a) + \gamma \sum_{s'} p(s'|s,a) v(s')5 or a small fixed constant are common.
  • Sampling distributions Lv(s,a)≡r(s,a)+γ∑s′p(s′∣s,a)v(s′)L^v(s,a) \equiv r(s,a) + \gamma \sum_{s'} p(s'|s,a) v(s')6 and Lv(s,a)≡r(s,a)+γ∑s′p(s′∣s,a)v(s′)L^v(s,a) \equiv r(s,a) + \gamma \sum_{s'} p(s'|s,a) v(s')7: Uniform sampling is often employed due to favorable theoretical properties. Non-uniform "importance" sampling may enhance performance when prior knowledge about good actions or "hard" states is available.
  • Scheduling: States may be updated in any order, adaptively or otherwise, provided each state is visited infinitely often and Lv(s,a)≡r(s,a)+γ∑s′p(s′∣s,a)v(s′)L^v(s,a) \equiv r(s,a) + \gamma \sum_{s'} p(s'|s,a) v(s')8 for all Lv(s,a)≡r(s,a)+γ∑s′p(s′∣s,a)v(s′)L^v(s,a) \equiv r(s,a) + \gamma \sum_{s'} p(s'|s,a) v(s')9. No synchronous or sweep order is required.

5. Empirical Performance and Evaluation

DAVI, AVI, and VI were compared on:

  • Single-state domains (nn0 actions): "Needle-in-haystack" (one reward-1 action) and "multi-reward" (10 reward-1 actions).
  • Multi-state, large-action MDPs:
    • Depth-2 tree (nn1 states, 50 actions per nonleaf, one rewarding leaf).
    • Random MDP (100 states; 1000 actions per state, each leading to 10 random successors, one state-action rewarding).

Performance was measured via average state-value against "compute-adjusted" runtime (number of look-aheads). In needle-in-haystack settings, AVI/VI did not improve until a full action scan was performed, whereas DAVI with small nn2 made gradual progress and larger nn3 converged faster. In multi-reward or multi-path settings, moderate nn4 (nn5–nn6) facilitated substantially faster convergence for DAVI compared to AVI/VI. Additional experiments with Pareto- or normal-distributed rewards showed that sampling is particularly effective when many near-optimal actions are present.

6. Strengths, Limitations, and Open Directions

Strengths:

  • Combines state- and action-asynchrony to handle extremely large state and action spaces.
  • Retains provable convergence and near-geometric rates, with computation complexity near that of VI.
  • Effective in empirical domains with multiple (near-)optimal actions.

Limitations and open questions:

  • In "needle-in-haystack" domains, uniform sampling can be inefficient if nn7 is small; large nn8 may be necessary.
  • DAVI does not reduce the computational burden of successor-state summations; combining with "small backups" or successor-sampling is an open possibility.
  • Logarithmic factors in complexity bounds nn9 may not be tight; eliminating these remains unresolved.
  • Adaptive or non-uniform action/state sampling strategies are largely unexplored.

DAVI preserves the theoretical guarantees of value iteration while offering substantial reductions in per-iteration cost for MDPs with large action spaces, thus facilitating flexible, fully asynchronous updates in both state and action dimensions (Tian et al., 2022).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Doubly-Asynchronous Value Iteration (DAVI).