Doubly-Asynchronous Value Iteration
- Doubly-Asynchronous Value Iteration (DAVI) is a dynamic programming approach that asynchronously updates both states and sampled actions to efficiently solve large-scale Markov Decision Processes.
- It introduces asynchrony in both state and action updates by performing in-place Bellman updates on randomly selected states and user-specified action subsets, reducing per-update computational cost.
- Careful selection of the action-subset size and sampling strategies enables DAVI to maintain almost-sure convergence and near-geometric rates while demonstrating robust empirical performance across diverse MDP scenarios.
Doubly-Asynchronous Value Iteration (DAVI) is a generalization of classical Value Iteration (VI) and Asynchronous VI (AVI) designed to address the computational challenges of dynamic programming in Markov Decision Processes (MDPs) with large state and action spaces. In contrast to VI, which synchronously updates all states and maximizes over the entire action space, and AVI, which introduces asynchrony over the state updates but retains full maximization over actions, DAVI introduces asynchrony over both states and actions. Specifically, DAVI performs in-place Bellman updates at randomly sampled states while maximizing over user-specified sampled subsets of actions. This approach asymptotically retains the desirable properties of VI—almost-sure convergence to the optimal value function, a near-geometric convergence rate with high probability, and computation time nearly matching established complexity bounds—while reducing per-update computational cost in large action domains (Tian et al., 2022).
1. Algorithmic Structure and Pseudocode
Let denote a discounted MDP with finite state space , action space , bounded reward , and discount factor . The Bellman look-ahead is defined as .
DAVI operates iteratively as follows:
- At iteration , sample from a state-sampling distribution.
- Sample subset of size , using distribution 0.
- Compute 1, where 2 is the current "best-so-far" action for 3.
- Update the value function:
4
- Update the policy at 5: 6 if 7, else 8.
Pseudocode:
0
Every state-action pair 9 must have nonzero probability of being sampled: 0.
2. Theoretical Convergence Properties
Convergence results under DAVI are established with the following key assumptions:
- Rewards are bounded: 1.
- Discount 2.
- State sampling 3 and joint sampling 4 for all 5.
- 6 initialized as all zeros, all negative constant, or 7.
(a) Almost-sure convergence: With probability 1, 8 as 9. This is grounded in value monotonicity (0), boundedness (1), and the fact that the update sequence almost surely covers every 2, resulting in contraction towards 3.
(b) Near-geometric convergence rate: For any fixed 4, let 5 and 6. With probability at least 7, after
8
iterations, 9.
3. Computational Complexity and Sample Efficiency
Each update involves 0 look-ahead computations per iteration (for 1 sampled actions plus the "best-so-far" action). With only one state updated per iteration, cost is 2. To ensure an 3-optimal policy with probability 4 requires
5
iterations, yielding 6 total look-aheads. For uniform action sampling 7, the resulting bound matches the classic 8 value iteration bound up to logarithmic factors. Lower bounds of 9 apply to both DAVI and AVI, indicating near-optimality up to these log terms.
4. Parameter Selection, Sampling, and Update Scheduling
- Action-subset size 0: There is a trade-off between per-update cost (proportional to 1) and probability of sampling near-optimal actions (2). In practice, 3 is typical to reduce computational burden, though 4 must be large enough to deliver sufficient probability of selecting near-optimal actions. Choices such as 5 or a small fixed constant are common.
- Sampling distributions 6 and 7: Uniform sampling is often employed due to favorable theoretical properties. Non-uniform "importance" sampling may enhance performance when prior knowledge about good actions or "hard" states is available.
- Scheduling: States may be updated in any order, adaptively or otherwise, provided each state is visited infinitely often and 8 for all 9. No synchronous or sweep order is required.
5. Empirical Performance and Evaluation
DAVI, AVI, and VI were compared on:
- Single-state domains (0 actions): "Needle-in-haystack" (one reward-1 action) and "multi-reward" (10 reward-1 actions).
- Multi-state, large-action MDPs:
- Depth-2 tree (1 states, 50 actions per nonleaf, one rewarding leaf).
- Random MDP (100 states; 1000 actions per state, each leading to 10 random successors, one state-action rewarding).
Performance was measured via average state-value against "compute-adjusted" runtime (number of look-aheads). In needle-in-haystack settings, AVI/VI did not improve until a full action scan was performed, whereas DAVI with small 2 made gradual progress and larger 3 converged faster. In multi-reward or multi-path settings, moderate 4 (5–6) facilitated substantially faster convergence for DAVI compared to AVI/VI. Additional experiments with Pareto- or normal-distributed rewards showed that sampling is particularly effective when many near-optimal actions are present.
6. Strengths, Limitations, and Open Directions
Strengths:
- Combines state- and action-asynchrony to handle extremely large state and action spaces.
- Retains provable convergence and near-geometric rates, with computation complexity near that of VI.
- Effective in empirical domains with multiple (near-)optimal actions.
Limitations and open questions:
- In "needle-in-haystack" domains, uniform sampling can be inefficient if 7 is small; large 8 may be necessary.
- DAVI does not reduce the computational burden of successor-state summations; combining with "small backups" or successor-sampling is an open possibility.
- Logarithmic factors in complexity bounds 9 may not be tight; eliminating these remains unresolved.
- Adaptive or non-uniform action/state sampling strategies are largely unexplored.
DAVI preserves the theoretical guarantees of value iteration while offering substantial reductions in per-iteration cost for MDPs with large action spaces, thus facilitating flexible, fully asynchronous updates in both state and action dimensions (Tian et al., 2022).