Value Iteration for Online Replanning

Updated 12 November 2025

Value Iteration for Online Replanning is a family of algorithms that update value functions incrementally to enable real-time adaptation in dynamic and uncertain environments.
Differentiable variants like VProp and asynchronous methods such as iPolicy and DAVI improve computational efficiency while scaling to large state-action spaces.
Empirical results in grid-world navigation, robotic feedback, and large MDPs demonstrate robust performance and rapid policy refinement under changing conditions.

Value iteration for online replanning refers to a family of algorithmic approaches that leverage value iteration (VI) or its differentiable, asynchronous, or sampled variants to achieve real-time plan adaptation in dynamic, partially observable, and/or high-dimensional environments. The necessity of online replanning arises when an environment undergoes frequent changes (e.g., moving obstacles, stochastic transitions, or adversarial agents), requiring the agent not merely to plan ahead, but to adapt its policy quickly as new information is observed. Recent work formalizes online replanning architectures both in the context of discrete Markov Decision Processes (MDPs) and continuous control, focusing on efficient computational strategies and scalable representations.

1. Core Concepts and Motivation

Classical value iteration updates the value function for all states synchronously, with sweeps over the full action space. This yields optimal policies in tabular settings, but is computationally prohibitive for large-scale or continuous domains, especially when planning must be updated in real time as the environment changes. Online replanning reframes VI as an “anytime” procedure, maintaining and updating a value function or policy estimate in the background, interleaving replanning with action execution.

Key motivating domains include:

Grid-world navigation with dynamic obstacles
Robotic feedback motion planning under changing dynamics or free space
Large-scale MDPs with vast state-action spaces where synchronous sweeps are impractical
Pixel-to-action planning from raw sensory data

In these scenarios, the planner must adjust to observation-driven state changes on-the-fly, often trading off between action quality and computational timeliness.

2. Differentiable Value Iteration for Replanning: Value Propagation Networks

Value Propagation (VProp) networks operationalize value iteration as a differentiable, parameter-efficient module within a reinforcement learning agent (Nardelli et al., 2018). The foundational elements are:

Input Embedding $\Phi(s)$ : A shallow convolutional (or graph convolutional) neural network processes the current state observation (grid maps or downsampled pixels). For each state $(i,j)$ $(i, j)$ , $\Phi(s)$ $Φ (s)$ outputs:
- $r^{(\text{in})}_{i,j}$ : entry reward
- $r^{(\text{out})}_{i,j}$ : exit cost
- $p_{i,j}$ : gating coefficient (state-dependent propagation/discount, $p_{i,j}\in[0,1]$ )
Bellman-style Recurrence: For $k=1...K$ (planning steps), update value maps via

$v_{i,j}^{(k)} = \max \left\{ v_{i,j}^{(k-1)},\; \max_{(i',j')\in\mathcal{N}(i,j)}\left[ p_{i,j}\,v_{i',j'}^{(k-1)} + (r_{i',j'}^{(\text{in})} - r_{i,j}^{(\text{out})}) \right] \right\}$

where $\mathcal{N}(i,j)$ are spatial/grid neighbors.

Online Adaptation: Each time step, the environment is sensed anew, $\Phi(s)$ recomputes $r,p$ for each location, and $K$ VI steps are unrolled in the forward/actor pass. Newly discovered obstacles suppress $p_{i,j}$ (near zero), blocking propagation and steering the value “wavefront” dynamically with minimal computational overhead.
End-to-end RL Training: Embedded in an actor–critic loop, VProp enables parameter learning by backpropagation through the $K$ $K$ -step VI. This supports:
- Training on arbitrary interactive tasks
- Weight sharing between the value network and critic
- Removal of supervised planner trace dependencies
Empirical Results: VProp achieves robust performance in both static and dynamically changing grid-worlds and pixel-level navigation tasks (94% win rate on $16\times 16$ grid-worlds, 53% at $64\times 64$ , outperforming VIN, and near-optimal performance with MVProp). The gating mechanism and positive-reward propagation in MVProp demonstrate enhanced generalization and replanning robustness. The architecture scales to real-time with modern GPUs (tens of milliseconds for $64\times64$ maps).

3. Incremental, Anytime Value Iteration: iPolicy Algorithm

The iPolicy algorithm provides a general framework for feedback motion planning using incremental graph construction and asynchronous value iteration (Zhao et al., 5 Jan 2024). Key components:

Graph-Based Discretization: The method incrementally samples the continuous state space, expanding a graph $G_k=(V_k,E_k)$ whose nodes $V_k$ incrementally approximate the domain. Edges encode one-step dynamical feasibility ( $\dot{x}=f(x,u)$ under a timestep $\epsilon_k$ and uncertainty radius).
Kruzhkov-Transformed Value Estimates: The minimal travel time $T^*$ is estimated using $\Theta^*(x)=1-e^{-T^*(x)}$ , which improves numerical properties.

Asynchronous Bellman Updates: At each iteration, only a “stale” subset of nodes (newly added or not updated for

P

rounds) is revisited, via a depth-bounded recursive BackProp call:

def BackProp(x, ℓ):
    if x in G_goal or ℓ == 0:
        return Θ(x)
    for x' in F_k(x):
        v[x'] = BackProp(x', ℓ-1)
    Θ(x) = Δ_k + β_k * min_{x' in F_k(x)} v[x']
    return Θ(x)

This selective updating, rather than full synchronous sweeps, yields a sharply reduced amortized computational cost.

Anytime Replanning: Control execution proceeds in parallel with background updates; iPolicy can be halted at any time to yield a feasible feedback policy.
Convergence Guarantees: Under decay conditions for the sample dispersion ( $d_k\to 0$ ), timestep ( $\epsilon_k\to0$ , $\epsilon_k/d_k\to \infty$ ), and contraction requirements, the value function estimate converges almost surely to the true continuous-time optimum.
Complexity and Performance: Sampling and neighbor search are logarithmic in $|V_k|$ ; per-iteration asynchronous VI cost is $O(|V_k|(b)^{m_k})$ , but the amortized cost is $O(P(b)^{m_k} + P \log |V_k|)$ . Empirical studies on point-mass, simple car, and Dubins vehicle confirm real-time feasibility and monotonic improvement in the value function and closed-loop performance.

4. Doubly-Asynchronous Value Iteration for Large MDP Replanning

Doubly-Asynchronous Value Iteration (DAVI) generalizes asynchronous VI to permit sampling of both state and action subsets (Tian et al., 2022). Core features:

Finite MDP Structure: The method targets MDPs $(S,A,r,p,\gamma)$ with vast state and/or action sets. The goal is optimal value function $v^*$ :

$v^*(s) = \max_{a\in A}\{ r(s,a) + \gamma\sum_{s'}p(s'|s,a)v^*(s') \}$

Doubly-Asynchronous Updates: At each step:
- Sample state $s_n$
- Sample action subset $A_n$ of size $m\ll |A|$
- Compute lookahead $L^{v_n}(s_n, a)$ for $a\in A_n \cup \{\pi_n(s_n)\}$
- Update $v_{n+1}(s_n)$ with the maximum lookahead over sampled actions
- Update $\pi_{n+1}(s_n)$ only if $a_n^*$ exceeds prior best
Algorithmic Scalability: Each update is $O(m)$ in arithmetic, with $O(|S|)$ memory for value and action arrays.
Convergence: With appropriate sampling ( $\Pr[s_n=s,\,a\in A_n]>0$ for all pairs), almost sure convergence and high-probability geometric convergence are established. The total number of updates for $\epsilon$ -optimality:

$\tau = H_{\gamma,\epsilon}\frac{\ln(S H_{\gamma,\epsilon}/\delta)} {\ln(1/(1-q_{\min}))}$

with $q_{\min}=m/(S A)$ for uniform sampling.

Empirical Findings: For very large $|A|$ , smaller $m$ expedites updates; too small $m$ may slow overall convergence in sparse-reward settings, while moderate $m$ (e.g., $10-100$ in practice) yields the best trade-off.
Online Integration: In an online replanning loop, DAVI can run for a programmable, time-budgeted number of updates between control actions. State and action sampling can be biased towards high-priority regions of the state-action space (e.g., high Bellman error, or recent/likely future states).

5. Comparative Computational Complexity and Anytime Properties

Method	Per-Update Cost	Memory	Convergence
Full VI	$O(A S)$	$O(S)$	Synchronous, batch, optimal
Asynchronous VI	$O(A)$	$O(S)$	Monotonic, per-state
DAVI	$O(m)$	$O(S)$	Monotonic, sampled
iPolicy (async)	$O(\|V_k\|(b)^{m_k})$	$O(\|V_k\|)$	Monotonic, contractive
VProp (K-step conv)	$O(K)$ convolutions	$O($ model $)$	Differentiable, soft plan

In practice, doubly-asynchronous and incremental value iteration enable swift adaptation: both DAVI and iPolicy support monotonic value improvement during control, and may be interrupted at any point to produce feasible if suboptimal policies. VProp achieves replanning times in the tens of milliseconds range for substantial map sizes by exploiting GPU convolutional parallelism and local propagation.

6. Practical Guidance for Integration

Best practices across these methods include:

Parameterization: Choose propagation depth ( $K$ in VProp) or recursion budget ( $m_k$ in iPolicy) to match the expected planning horizon or computational constraints.
State/Action Sampling: For DAVI, begin with $m$ equal to available compute per cycle; focus state and action sampling on regions of high value error or near the current trajectory for accelerated adaptation.
Embedding and Feature Representations: Use convolutional or graph-convolutional embeddings for VProp in spatial domains; for non-grid environments, prefer attention-based mechanisms or local neighborhoods.
Real-Time Constraints: Limit sweep or propagation depth for hard timing requirements, or sparsify the planning graph.
Positive-Reward Propagation: When generalization is paramount (e.g., unseen obstacle arrangements), restrict value transfers to positive-reward paths (see MVProp) to enhance robustness.
Empirical Tuning: Dynamically adjust computational budgets (e.g., $m$ in DAVI) based on observed convergence speed and system latency constraints.

7. Empirical Results and Impact in Dynamic Settings

Experiments demonstrate the superiority of learning-based and sample-efficient value iteration techniques for online replanning:

MazeBase static and dynamic grid-worlds (VProp): VProp attains ~94% win rate on $16\times16$ grids, maintaining 53% at $64\times64$ , where VIN drops to ~4%. MVProp achieves ~100% across all sizes.
Adversarial and stochastic environments: VProp and MVProp outperform VIN in dynamic scenarios, with win rates of ~70–80% (VProp) and ~90–95% (MVProp) at $32\times32$ .
StarCraft navigation (VProp): Learns pixel-to-action planning; achieves ~80–90% success routing around dynamic, adversarial hazards, outperforming earlier differentiable VI architectures.
Robotic feedback planning (iPolicy): Real-time feedback policies enable complex maneuvers (e.g., parallel parking) in tens of seconds with monotonic improvement, confirmed in point-mass, unicycle, and Dubins car models.
Large-action MDPs (DAVI): For $10^4$ -action settings, moderate action sample sizes yield rapid convergence with significant computational savings, supporting timely policy refinement during execution.

These results collectively establish value iteration-based replanners as a practical foundation for adaptive control under dynamic, high-dimensional, or uncertain conditions, with algorithmic trade-offs tailored to problem structure, resource budgets, and real-time requirements.

PDF Markdown Chat (Pro)

References (3)

Value Propagation Networks (2018)

iPolicy: Incremental Policy Algorithms for Feedback Motion Planning (2024)

Doubly-Asynchronous Value Iteration: Making Value Iteration Asynchronous in Actions (2022)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Value Iteration for Online Replanning.