Papers
Topics
Authors
Recent
2000 character limit reached

Value Iteration for Online Replanning

Updated 12 November 2025
  • Value Iteration for Online Replanning is a family of algorithms that update value functions incrementally to enable real-time adaptation in dynamic and uncertain environments.
  • Differentiable variants like VProp and asynchronous methods such as iPolicy and DAVI improve computational efficiency while scaling to large state-action spaces.
  • Empirical results in grid-world navigation, robotic feedback, and large MDPs demonstrate robust performance and rapid policy refinement under changing conditions.

Value iteration for online replanning refers to a family of algorithmic approaches that leverage value iteration (VI) or its differentiable, asynchronous, or sampled variants to achieve real-time plan adaptation in dynamic, partially observable, and/or high-dimensional environments. The necessity of online replanning arises when an environment undergoes frequent changes (e.g., moving obstacles, stochastic transitions, or adversarial agents), requiring the agent not merely to plan ahead, but to adapt its policy quickly as new information is observed. Recent work formalizes online replanning architectures both in the context of discrete Markov Decision Processes (MDPs) and continuous control, focusing on efficient computational strategies and scalable representations.

1. Core Concepts and Motivation

Classical value iteration updates the value function for all states synchronously, with sweeps over the full action space. This yields optimal policies in tabular settings, but is computationally prohibitive for large-scale or continuous domains, especially when planning must be updated in real time as the environment changes. Online replanning reframes VI as an “anytime” procedure, maintaining and updating a value function or policy estimate in the background, interleaving replanning with action execution.

Key motivating domains include:

  • Grid-world navigation with dynamic obstacles
  • Robotic feedback motion planning under changing dynamics or free space
  • Large-scale MDPs with vast state-action spaces where synchronous sweeps are impractical
  • Pixel-to-action planning from raw sensory data

In these scenarios, the planner must adjust to observation-driven state changes on-the-fly, often trading off between action quality and computational timeliness.

2. Differentiable Value Iteration for Replanning: Value Propagation Networks

Value Propagation (VProp) networks operationalize value iteration as a differentiable, parameter-efficient module within a reinforcement learning agent (Nardelli et al., 2018). The foundational elements are:

  • Input Embedding Φ(s)\Phi(s): A shallow convolutional (or graph convolutional) neural network processes the current state observation (grid maps or downsampled pixels). For each state (i,j)(i,j), Φ(s)\Phi(s) outputs:
    • ri,j(in)r^{(\text{in})}_{i,j}: entry reward
    • ri,j(out)r^{(\text{out})}_{i,j}: exit cost
    • pi,jp_{i,j}: gating coefficient (state-dependent propagation/discount, pi,j[0,1]p_{i,j}\in[0,1])
  • Bellman-style Recurrence: For k=1...Kk=1...K (planning steps), update value maps via

vi,j(k)=max{vi,j(k1),  max(i,j)N(i,j)[pi,jvi,j(k1)+(ri,j(in)ri,j(out))]}v_{i,j}^{(k)} = \max \left\{ v_{i,j}^{(k-1)},\; \max_{(i',j')\in\mathcal{N}(i,j)}\left[ p_{i,j}\,v_{i',j'}^{(k-1)} + (r_{i',j'}^{(\text{in})} - r_{i,j}^{(\text{out})}) \right] \right\}

where N(i,j)\mathcal{N}(i,j) are spatial/grid neighbors.

  • Online Adaptation: Each time step, the environment is sensed anew, Φ(s)\Phi(s) recomputes r,pr,p for each location, and KK VI steps are unrolled in the forward/actor pass. Newly discovered obstacles suppress pi,jp_{i,j} (near zero), blocking propagation and steering the value “wavefront” dynamically with minimal computational overhead.
  • End-to-end RL Training: Embedded in an actor–critic loop, VProp enables parameter learning by backpropagation through the KK-step VI. This supports:
    • Training on arbitrary interactive tasks
    • Weight sharing between the value network and critic
    • Removal of supervised planner trace dependencies
  • Empirical Results: VProp achieves robust performance in both static and dynamically changing grid-worlds and pixel-level navigation tasks (94% win rate on 16×1616\times 16 grid-worlds, 53% at 64×6464\times 64, outperforming VIN, and near-optimal performance with MVProp). The gating mechanism and positive-reward propagation in MVProp demonstrate enhanced generalization and replanning robustness. The architecture scales to real-time with modern GPUs (tens of milliseconds for 64×6464\times64 maps).

3. Incremental, Anytime Value Iteration: iPolicy Algorithm

The iPolicy algorithm provides a general framework for feedback motion planning using incremental graph construction and asynchronous value iteration (Zhao et al., 5 Jan 2024). Key components:

  • Graph-Based Discretization: The method incrementally samples the continuous state space, expanding a graph Gk=(Vk,Ek)G_k=(V_k,E_k) whose nodes VkV_k incrementally approximate the domain. Edges encode one-step dynamical feasibility (x˙=f(x,u)\dot{x}=f(x,u) under a timestep ϵk\epsilon_k and uncertainty radius).
  • Kruzhkov-Transformed Value Estimates: The minimal travel time TT^* is estimated using Θ(x)=1eT(x)\Theta^*(x)=1-e^{-T^*(x)}, which improves numerical properties.
  • Asynchronous Bellman Updates: At each iteration, only a “stale” subset of nodes (newly added or not updated for PP rounds) is revisited, via a depth-bounded recursive BackProp call:
    1
    2
    3
    4
    5
    6
    7
    
    def BackProp(x, ℓ):
        if x in G_goal or ℓ == 0:
            return Θ(x)
        for x' in F_k(x):
            v[x'] = BackProp(x', ℓ-1)
        Θ(x) = Δ_k + β_k * min_{x' in F_k(x)} v[x']
        return Θ(x)
    This selective updating, rather than full synchronous sweeps, yields a sharply reduced amortized computational cost.
  • Anytime Replanning: Control execution proceeds in parallel with background updates; iPolicy can be halted at any time to yield a feasible feedback policy.
  • Convergence Guarantees: Under decay conditions for the sample dispersion (dk0d_k\to 0), timestep (ϵk0\epsilon_k\to0, ϵk/dk\epsilon_k/d_k\to \infty), and contraction requirements, the value function estimate converges almost surely to the true continuous-time optimum.
  • Complexity and Performance: Sampling and neighbor search are logarithmic in Vk|V_k|; per-iteration asynchronous VI cost is O(Vk(b)mk)O(|V_k|(b)^{m_k}), but the amortized cost is O(P(b)mk+PlogVk)O(P(b)^{m_k} + P \log |V_k|). Empirical studies on point-mass, simple car, and Dubins vehicle confirm real-time feasibility and monotonic improvement in the value function and closed-loop performance.

4. Doubly-Asynchronous Value Iteration for Large MDP Replanning

Doubly-Asynchronous Value Iteration (DAVI) generalizes asynchronous VI to permit sampling of both state and action subsets (Tian et al., 2022). Core features:

  • Finite MDP Structure: The method targets MDPs (S,A,r,p,γ)(S,A,r,p,\gamma) with vast state and/or action sets. The goal is optimal value function vv^*:

v(s)=maxaA{r(s,a)+γsp(ss,a)v(s)}v^*(s) = \max_{a\in A}\{ r(s,a) + \gamma\sum_{s'}p(s'|s,a)v^*(s') \}

  • Doubly-Asynchronous Updates: At each step:
    • Sample state sns_n
    • Sample action subset AnA_n of size mAm\ll |A|
    • Compute lookahead Lvn(sn,a)L^{v_n}(s_n, a) for aAn{πn(sn)}a\in A_n \cup \{\pi_n(s_n)\}
    • Update vn+1(sn)v_{n+1}(s_n) with the maximum lookahead over sampled actions
    • Update πn+1(sn)\pi_{n+1}(s_n) only if ana_n^* exceeds prior best
  • Algorithmic Scalability: Each update is O(m)O(m) in arithmetic, with O(S)O(|S|) memory for value and action arrays.
  • Convergence: With appropriate sampling (Pr[sn=s,aAn]>0\Pr[s_n=s,\,a\in A_n]>0 for all pairs), almost sure convergence and high-probability geometric convergence are established. The total number of updates for ϵ\epsilon-optimality:

τ=Hγ,ϵln(SHγ,ϵ/δ)ln(1/(1qmin))\tau = H_{\gamma,\epsilon}\frac{\ln(S H_{\gamma,\epsilon}/\delta)} {\ln(1/(1-q_{\min}))}

with qmin=m/(SA)q_{\min}=m/(S A) for uniform sampling.

  • Empirical Findings: For very large A|A|, smaller mm expedites updates; too small mm may slow overall convergence in sparse-reward settings, while moderate mm (e.g., $10-100$ in practice) yields the best trade-off.
  • Online Integration: In an online replanning loop, DAVI can run for a programmable, time-budgeted number of updates between control actions. State and action sampling can be biased towards high-priority regions of the state-action space (e.g., high Bellman error, or recent/likely future states).

5. Comparative Computational Complexity and Anytime Properties

Method Per-Update Cost Memory Convergence
Full VI O(AS)O(A S) O(S)O(S) Synchronous, batch, optimal
Asynchronous VI O(A)O(A) O(S)O(S) Monotonic, per-state
DAVI O(m)O(m) O(S)O(S) Monotonic, sampled
iPolicy (async) O(Vk(b)mk)O(|V_k|(b)^{m_k}) O(Vk)O(|V_k|) Monotonic, contractive
VProp (K-step conv) O(K)O(K) convolutions O(O(model)) Differentiable, soft plan

In practice, doubly-asynchronous and incremental value iteration enable swift adaptation: both DAVI and iPolicy support monotonic value improvement during control, and may be interrupted at any point to produce feasible if suboptimal policies. VProp achieves replanning times in the tens of milliseconds range for substantial map sizes by exploiting GPU convolutional parallelism and local propagation.

6. Practical Guidance for Integration

Best practices across these methods include:

  • Parameterization: Choose propagation depth (KK in VProp) or recursion budget (mkm_k in iPolicy) to match the expected planning horizon or computational constraints.
  • State/Action Sampling: For DAVI, begin with mm equal to available compute per cycle; focus state and action sampling on regions of high value error or near the current trajectory for accelerated adaptation.
  • Embedding and Feature Representations: Use convolutional or graph-convolutional embeddings for VProp in spatial domains; for non-grid environments, prefer attention-based mechanisms or local neighborhoods.
  • Real-Time Constraints: Limit sweep or propagation depth for hard timing requirements, or sparsify the planning graph.
  • Positive-Reward Propagation: When generalization is paramount (e.g., unseen obstacle arrangements), restrict value transfers to positive-reward paths (see MVProp) to enhance robustness.
  • Empirical Tuning: Dynamically adjust computational budgets (e.g., mm in DAVI) based on observed convergence speed and system latency constraints.

7. Empirical Results and Impact in Dynamic Settings

Experiments demonstrate the superiority of learning-based and sample-efficient value iteration techniques for online replanning:

  • MazeBase static and dynamic grid-worlds (VProp): VProp attains ~94% win rate on 16×1616\times16 grids, maintaining 53% at 64×6464\times64, where VIN drops to ~4%. MVProp achieves ~100% across all sizes.
  • Adversarial and stochastic environments: VProp and MVProp outperform VIN in dynamic scenarios, with win rates of ~70–80% (VProp) and ~90–95% (MVProp) at 32×3232\times32.
  • StarCraft navigation (VProp): Learns pixel-to-action planning; achieves ~80–90% success routing around dynamic, adversarial hazards, outperforming earlier differentiable VI architectures.
  • Robotic feedback planning (iPolicy): Real-time feedback policies enable complex maneuvers (e.g., parallel parking) in tens of seconds with monotonic improvement, confirmed in point-mass, unicycle, and Dubins car models.
  • Large-action MDPs (DAVI): For 10410^4-action settings, moderate action sample sizes yield rapid convergence with significant computational savings, supporting timely policy refinement during execution.

These results collectively establish value iteration-based replanners as a practical foundation for adaptive control under dynamic, high-dimensional, or uncertain conditions, with algorithmic trade-offs tailored to problem structure, resource budgets, and real-time requirements.

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Value Iteration for Online Replanning.