Papers
Topics
Authors
Recent
Search
2000 character limit reached

Dynamic Speculative Planning (DSP)

Updated 4 July 2026
  • Dynamic Speculative Planning (DSP) is a method that speculates with a fast approximator and verifies results with a stronger mechanism, ensuring adaptive and lossless acceleration.
  • DSP employs control variables like horizon, support, and width to optimize decoding, agent planning, and path evaluation across diverse applications.
  • DSP balances latency, cost, and safety via online prediction and verification, achieving efficiency improvements and significant speedups in various domains.

Dynamic Speculative Planning (DSP) denotes a class of methods that speculate ahead with a cheap approximation process, verify or correct those speculative results with a stronger or exact mechanism, and adapt speculation online rather than fixing it in advance. The term is used explicitly for an asynchronous online reinforcement learning framework for LLM-based agents that predicts speculation depth, drafts future actions with an approximation agent, verifies them with a target agent, and exposes a single control parameter for the latency–cost tradeoff (Guan et al., 2 Sep 2025). Across adjacent literatures, the same pattern appears in dynamic speculation length optimization for decoding, dynamic token-tree construction, context-dependent vocabulary restriction, serving-time horizon control, speculative path planning, and safety-assured planning under uncertain futures (Mamou et al., 2024, Xiong et al., 2024, Zhang et al., 11 Oct 2025, Li et al., 27 Dec 2025, Bakhshalipour et al., 2021, Liu et al., 2023).

1. Emergence and conceptual scope

In the narrowest sense, DSP names the agent framework in which speculative multi-step plans are generated by a cheaper approximation agent, verified by the original stronger agent, and scheduled by an online predictor that learns how far to speculate at each state (Guan et al., 2 Sep 2025). In a broader sense, the literature shows the same architectural pattern in several settings even when authors use different names. Dynamic speculation length optimization for decoding, dynamic-width speculative beam decoding, speculative verification for Vision-Language-Action control, and speculative planning with adaptive prediction all instantiate the same core structure: propose cheaply, validate with a stronger mechanism, and update the speculative horizon or support as the state changes (Mamou et al., 2024, Qin et al., 2024, Wang et al., 3 Apr 2026, Liu et al., 2023).

A useful synthesis is that DSP is not tied to one granularity. In token generation, the speculative object can be a lookahead length, a token tree, or a context-dependent shortlist. In agent systems, it can be a sequence of future actions. In path planning, it can be future node evaluations. In autonomous driving, it can be a distribution over behavior modes and trajectory realizations. This suggests that DSP is best characterized by its control structure rather than by any single application domain.

In the broader literature, the label often functions as an interpretive umbrella rather than authors’ formal taxonomy. Several papers are explicit that they do not use the exact term “Dynamic Speculative Planning,” yet their methods are naturally described that way because they perform online, state-dependent speculative allocation of compute or action budget before a stronger verifier, safety filter, or exact model resolves the decision (Bakhshalipour et al., 2021, Hua et al., 2024, Huang et al., 24 Mar 2026).

2. Core mechanism and formal structure

The agent formulation in "Dynamic Speculative Agent Planning" (Guan et al., 2 Sep 2025) is the most explicit formalization of DSP. It models speculative-step control as an MDP

M=(S,A,P,p0,R,γ),M=(\mathcal{S}, \mathcal{A}, P, p_0, R, \gamma),

where states are partial trajectories or token sequences, actions are speculative steps, and the policy predicts how many future actions the approximation agent should attempt before waiting for verification. The target agent remains authoritative: if the approximation agent diverges, the target’s action is adopted, downstream speculative threads are canceled, and execution resumes from the corrected state. In this framework, “lossless acceleration” means that the final committed plan always follows the target agent’s policy (Guan et al., 2 Sep 2025).

The predictor is trained online with TD learning and λ\lambda-returns: Lθ=Eτπ[(GtλVθ(st))2],L_{\theta} = E_{\tau \sim \pi} \left[(G_t^\lambda - V_{\theta}(s_t))^2\right], with expectile bias introduced through

Lθ=Eτπ[L2τ(GtλVθ(st))],L2τ(u)=τ1(u<0)u2.L_{\theta} = E_{\tau \sim \pi} \left[L_2^\tau(G_t^\lambda - V_\theta(s_t))\right], \qquad L_2^\tau(u)=|\tau-\mathbf{1}(u<0)|u^2.

The expectile parameter τ\tau is the principal control knob: τ>0.5\tau>0.5 biases toward larger speculative depth and lower latency at higher cost, while τ<0.5\tau<0.5 biases toward conservative speculation and lower cost (Guan et al., 2 Sep 2025). A second, simpler control is the post-prediction offset

k=max(1,k^+β),k=\max(1,\hat{k}+\beta),

which changes aggressiveness without retraining.

This formal structure has close analogues elsewhere. DISCO treats speculative decoding as a runtime stop/continue decision, using a shallow classifier

Ci=FFN(Concat(Topk(yiD),Ent(yiD),i))C_i=FFN(Concat(Top_k(y^{D}_i),Ent(y^{D}_i),i))

to decide whether to continue drafting or verify immediately (Mamou et al., 2024). Nightjar formulates serving-time control as choosing speculative length γ{0,1,,Γmax}\gamma \in \{0,1,\dots,\Gamma_{\max}\}, including λ\lambda0 to disable speculation entirely when it becomes counterproductive under high load (Li et al., 27 Dec 2025). DSDE uses a feedback controller over speculation length based on recent KLD behavior, rather than a learned router, showing that DSP need not be learned end to end to be effective (Yang et al., 1 Sep 2025).

3. Control variables: horizon, support, and width

A plausible taxonomy is that DSP methods adapt one or more of three quantities: speculative horizon, proposal support, and speculative width.

Control variable Mechanism Representative papers
Horizon Dynamic speculation length or step λ\lambda1 (Guan et al., 2 Sep 2025, Mamou et al., 2024, Li et al., 27 Dec 2025, Sridhar et al., 3 Nov 2025, Yang et al., 1 Sep 2025)
Support Context-dependent action or vocabulary restriction (Zhang et al., 11 Oct 2025)
Width Dynamic beam/tree branching (Qin et al., 2024, Xiong et al., 2024, Andronov et al., 2 Aug 2025)

Horizon control is the most common form. DISCO shows that a fixed speculation lookahead is suboptimal because oracle speculation length varies substantially across iterations; empirically it improves latency by an average of λ\lambda2 over the best static SL baseline while generating the exact same text (Mamou et al., 2024). Nightjar extends the same idea to continuous-batching serving, where the optimal λ\lambda3 depends on batch size and load regime rather than just prompt difficulty; it uses a contextual multi-armed bandit and can disable speculative decoding when no positive goodput remains (Li et al., 27 Dec 2025). TapOut makes the action one level more abstract by using a bandit to select among dynamic stopping heuristics instead of directly predicting depth, and does so without any training or hyperparameter tuning (Sridhar et al., 3 Nov 2025). DSDE replaces forward confidence with post-hoc KLD stability, computing

λ\lambda4

and then imposing a batch cap equal to the mean predicted speculation length across active sequences (Yang et al., 1 Sep 2025).

Support control appears most clearly in DynaSpec. Instead of projecting the drafter hidden state to the full vocabulary, DynaSpec partitions the vocabulary into token clusters and uses a lightweight router

λ\lambda5

to pick a context-dependent shortlist λ\lambda6. The paper proves that an oracle dynamic shortlist yields weakly better accepted length than any static shortlist of the same size, because expected retained target mass is larger for context-dependent support than for a fixed subset (Zhang et al., 11 Oct 2025). This is a strong DSP result: adapting the proposal space to state is theoretically superior to any static restriction under the same budget.

Width control appears in dynamic speculative beam and tree methods. DSBD adapts the number of accepted beams at each speculative layer by estimating the probability that enough draft beams will be accepted, and then sets a dynamic target width λ\lambda7 (Qin et al., 2024). DySpec does something analogous for token trees: it greedily expands the frontier node with highest estimated expected acceptance contribution, using draft probability as a surrogate for acceptance, and derives a surrogate optimality result for the resulting subtree (Xiong et al., 2024). In retrosynthetic planning, speculative beam search combined with Medusa performs the same function at the level of precursor SMILES generation, accelerating the inner beam-search expansion used by AiZynthFinder (Andronov et al., 2 Aug 2025).

4. Verification, exactness, and safety

Verification is the defining counterweight to speculation. In speculative decoding, correctness is preserved because the verifier is the target model and only target-consistent tokens are committed. DISCO retains Leviathan-style rejection sampling and therefore preserves the exact target-model output distribution; DynaSpec similarly keeps verification over the full target vocabulary even though drafting uses a dynamic shortlist (Mamou et al., 2024, Zhang et al., 11 Oct 2025). In DySpec, accepted draft tokens are still filtered through target-model probabilities and residual sampling, even though the dynamic token tree is planned using draft-model surrogates (Xiong et al., 2024).

The agent version of DSP generalizes this exactness principle from tokens to actions. In Dynamic Speculative Agent Planning, only target-verified actions are committed, so speculative concurrency changes execution order and cost but not the final behavior of the target agent (Guan et al., 2 Sep 2025). Interactive Speculative Planning uses the same prefix-validity rule, except that a human can also replace a step during visible delay or immediately after target output; in the current implementation, that user-provided step acts like an oracle correction (Hua et al., 2024).

Not all verifiers are exact-model checks. In SV-VLA, the heavy VLA is a low-frequency macro-planner, while a lightweight verifier predicts a closed-loop reference action

λ\lambda8

compares it with the current planned action using

λ\lambda9

and triggers replanning when Lθ=Eτπ[(GtλVθ(st))2],L_{\theta} = E_{\tau \sim \pi} \left[(G_t^\lambda - V_{\theta}(s_t))^2\right],0 (Wang et al., 3 Apr 2026). Here verification is an online consistency test under fresh observations rather than an exact symbolic accept/reject rule.

Safety-constrained DSP appears in autonomous driving and path planning. Safety-Assured Speculative Planning with Adaptive Prediction defines a probabilistic prediction interface Lθ=Eτπ[(GtλVθ(st))2],L_{\theta} = E_{\tau \sim \pi} \left[(G_t^\lambda - V_{\theta}(s_t))^2\right],1, optimizes expected reward over predicted futures, and rejects any current action that may be unsafe in the worst case over all feasible realizations. The paper proves that safety is preserved if the realized future remains inside the conservative prediction support and a safe decision exists initially (Liu et al., 2023). Speculative Path Planning keeps A*’s open/close order unchanged and speculatively precomputes collision checks for likely future states, thereby preserving optimality for A* and Lθ=Eτπ[(GtλVθ(st))2],L_{\theta} = E_{\tau \sim \pi} \left[(G_t^\lambda - V_{\theta}(s_t))^2\right],2-suboptimality for weighted A* while exploiting idle threads (Bakhshalipour et al., 2021). These cases suggest that DSP is compatible with exactness, robust safety, and planner invariants, provided speculation is confined to proposals or auxiliary computations rather than final commitment.

5. Domain realizations

The same control pattern recurs across domains, but the speculative object changes.

Domain Speculative object Representative systems
LLM agents Future action steps or plan depth (Guan et al., 2 Sep 2025, Hua et al., 2024)
LLM decoding and serving Lookahead length, shortlist, token tree, beam width (Mamou et al., 2024, Zhang et al., 11 Oct 2025, Li et al., 27 Dec 2025, Sridhar et al., 3 Nov 2025, Yang et al., 1 Sep 2025, Qin et al., 2024, Xiong et al., 2024)
Embodied control Action chunks with online replanning (Wang et al., 3 Apr 2026)
Agentic multimodal reasoning Depth-0 direct answer versus full tool loop (Huang et al., 24 Mar 2026)
Search and planning Node evaluations or uncertain future scenarios (Bakhshalipour et al., 2021, Liu et al., 2023)
Retrosynthesis Precursor-SMILES beam continuations (Andronov et al., 2 Aug 2025)

In LLM agents, DSP is most explicit: a predictor chooses how many future actions the approximation agent should speculate, and the target agent verifies them asynchronously (Guan et al., 2 Sep 2025). Interactive Speculative Planning adds a user interface that exposes speculative delay and allows user interruption to modify the accepted prefix online (Hua et al., 2024).

In token decoding, the literature splits according to which quantity is adapted. DISCO, Nightjar, TapOut, and DSDE adapt horizon; DynaSpec adapts support; DSBD and DySpec adapt width or tree structure (Mamou et al., 2024, Li et al., 27 Dec 2025, Sridhar et al., 3 Nov 2025, Yang et al., 1 Sep 2025, Zhang et al., 11 Oct 2025, Qin et al., 2024, Xiong et al., 2024). Together these papers show that speculative planning in decoding is not limited to “how many tokens ahead,” but also includes “which tokens are even considered” and “how many futures remain alive.”

In embodied and multimodal systems, speculation moves above the token level. SV-VLA performs open-loop planning with closed-loop verification: a heavy planner proposes an action chunk, and a lightweight verifier decides whether execution may continue or replanning is necessary (Wang et al., 3 Apr 2026). SpecEyes lifts speculation to the agentic control loop itself: a small tool-free MLLM attempts a direct answer, and a confidence gate based on answer separability decides whether the system can terminate immediately or must defer to the full tool-using agent (Huang et al., 24 Mar 2026). That is a restricted but important DSP form: adaptive selection of whether a deep reasoning trajectory is needed at all.

In classical planning domains, the same pattern becomes more explicit as search-space control. Speculative Path Planning predicts likely future A* expansions and uses idle compute to pre-evaluate collision checks (Bakhshalipour et al., 2021). Safety-Assured Speculative Planning reasons over multiple behavior modes and trajectory parameters, updates them online, and chooses the current action by maximizing expected reward subject to a worst-case safety filter (Liu et al., 2023). In retrosynthesis, speculative beam search does not speculate over full synthesis routes, but it accelerates the inner single-step expansion so effectively that the outer planner solves substantially more molecules under fixed time limits (Andronov et al., 2 Aug 2025).

6. Empirical performance, trade-offs, and open questions

The empirical record is consistent on one point: fixed speculation policies are usually suboptimal. Dynamic Speculative Agent Planning achieves comparable efficiency to the fastest lossless acceleration method while reducing total cost by Lθ=Eτπ[(GtλVθ(st))2],L_{\theta} = E_{\tau \sim \pi} \left[(G_t^\lambda - V_{\theta}(s_t))^2\right],3 and unnecessary cost up to Lθ=Eτπ[(GtλVθ(st))2],L_{\theta} = E_{\tau \sim \pi} \left[(G_t^\lambda - V_{\theta}(s_t))^2\right],4 (Guan et al., 2 Sep 2025). DISCO improves latency by an average of Lθ=Eτπ[(GtλVθ(st))2],L_{\theta} = E_{\tau \sim \pi} \left[(G_t^\lambda - V_{\theta}(s_t))^2\right],5 over the best static SL baseline and Lθ=Eτπ[(GtλVθ(st))2],L_{\theta} = E_{\tau \sim \pi} \left[(G_t^\lambda - V_{\theta}(s_t))^2\right],6 over a dynamic heuristic baseline while preserving the same generated text as the target model (Mamou et al., 2024). DynaSpec raises mean accepted length from Lθ=Eτπ[(GtλVθ(st))2],L_{\theta} = E_{\tau \sim \pi} \left[(G_t^\lambda - V_{\theta}(s_t))^2\right],7 for FR-Spec to Lθ=Eτπ[(GtλVθ(st))2],L_{\theta} = E_{\tau \sim \pi} \left[(G_t^\lambda - V_{\theta}(s_t))^2\right],8 while using an even smaller average shortlist of about Lθ=Eτπ[(GtλVθ(st))2],L_{\theta} = E_{\tau \sim \pi} \left[(G_t^\lambda - V_{\theta}(s_t))^2\right],9 tokens rather than Lθ=Eτπ[L2τ(GtλVθ(st))],L2τ(u)=τ1(u<0)u2.L_{\theta} = E_{\tau \sim \pi} \left[L_2^\tau(G_t^\lambda - V_\theta(s_t))\right], \qquad L_2^\tau(u)=|\tau-\mathbf{1}(u<0)|u^2.0 (Zhang et al., 11 Oct 2025). Nightjar reaches up to Lθ=Eτπ[L2τ(GtλVθ(st))],L2τ(u)=τ1(u<0)u2.L_{\theta} = E_{\tau \sim \pi} \left[L_2^\tau(G_t^\lambda - V_\theta(s_t))\right], \qquad L_2^\tau(u)=|\tau-\mathbf{1}(u<0)|u^2.1 higher throughput and Lθ=Eτπ[L2τ(GtλVθ(st))],L2τ(u)=τ1(u<0)u2.L_{\theta} = E_{\tau \sim \pi} \left[L_2^\tau(G_t^\lambda - V_\theta(s_t))\right], \qquad L_2^\tau(u)=|\tau-\mathbf{1}(u<0)|u^2.2 lower latency than standard speculative decoding by adapting Lθ=Eτπ[L2τ(GtλVθ(st))],L2τ(u)=τ1(u<0)u2.L_{\theta} = E_{\tau \sim \pi} \left[L_2^\tau(G_t^\lambda - V_\theta(s_t))\right], \qquad L_2^\tau(u)=|\tau-\mathbf{1}(u<0)|u^2.3 to serving load and allowing Lθ=Eτπ[L2τ(GtλVθ(st))],L2τ(u)=τ1(u<0)u2.L_{\theta} = E_{\tau \sim \pi} \left[L_2^\tau(G_t^\lambda - V_\theta(s_t))\right], \qquad L_2^\tau(u)=|\tau-\mathbf{1}(u<0)|u^2.4 (Li et al., 27 Dec 2025). DySpec reports throughput improvements up to Lθ=Eτπ[L2τ(GtλVθ(st))],L2τ(u)=τ1(u<0)u2.L_{\theta} = E_{\tau \sim \pi} \left[L_2^\tau(G_t^\lambda - V_\theta(s_t))\right], \qquad L_2^\tau(u)=|\tau-\mathbf{1}(u<0)|u^2.5 and latency reductions up to Lθ=Eτπ[L2τ(GtλVθ(st))],L2τ(u)=τ1(u<0)u2.L_{\theta} = E_{\tau \sim \pi} \left[L_2^\tau(G_t^\lambda - V_\theta(s_t))\right], \qquad L_2^\tau(u)=|\tau-\mathbf{1}(u<0)|u^2.6 on Llama2-70B under low temperature (Xiong et al., 2024). In retrosynthesis, speculative beam search with Medusa lets AiZynthFinder solve Lθ=Eτπ[L2τ(GtλVθ(st))],L2τ(u)=τ1(u<0)u2.L_{\theta} = E_{\tau \sim \pi} \left[L_2^\tau(G_t^\lambda - V_\theta(s_t))\right], \qquad L_2^\tau(u)=|\tau-\mathbf{1}(u<0)|u^2.7 to Lθ=Eτπ[L2τ(GtλVθ(st))],L2τ(u)=τ1(u<0)u2.L_{\theta} = E_{\tau \sim \pi} \left[L_2^\tau(G_t^\lambda - V_\theta(s_t))\right], \qquad L_2^\tau(u)=|\tau-\mathbf{1}(u<0)|u^2.8 more molecules under the same time constraints of several seconds (Andronov et al., 2 Aug 2025). In multimodal agent settings, SpecEyes achieves Lθ=Eτπ[L2τ(GtλVθ(st))],L2τ(u)=τ1(u<0)u2.L_{\theta} = E_{\tau \sim \pi} \left[L_2^\tau(G_t^\lambda - V_\theta(s_t))\right], \qquad L_2^\tau(u)=|\tau-\mathbf{1}(u<0)|u^2.9–τ\tau0 speedup while preserving or improving accuracy by up to τ\tau1, and SV-VLA attains τ\tau2 success at τ\tau3 speed relative to a τ\tau4 closed-loop baseline and a τ\tau5 open-loop baseline with τ\tau6 success at τ\tau7 speed (Huang et al., 24 Mar 2026, Wang et al., 3 Apr 2026).

The trade-offs are equally consistent. Faster speculation usually raises wasted work unless the controller is state-aware. The DSP agent requires warmup and an extra predictor, even though it avoids offline pretraining (Guan et al., 2 Sep 2025). DISCO has no formal dynamic-optimality theorem, only oracle evidence and classifier-based control (Mamou et al., 2024). Nightjar’s context is mostly batch size, so different load states with similar batch sizes may still be conflated (Li et al., 27 Dec 2025). DSDE’s KLD-based signal is useful as a control diagnostic but weak as a token-level predictor, and the current implementation runs in eager mode because dynamic SL would require repeated CUDA graph recapture (Yang et al., 1 Sep 2025). SpecEyes remains restricted to depth-τ\tau8 speculation and relies on threshold tuning for answer separability (Huang et al., 24 Mar 2026). Safety-Assured Speculative Planning depends on a conservative prediction assumption that contains the true future (Liu et al., 2023). DSBD contains a notation inconsistency in the acceptance ratio between the main text and appendix proof, even though the intended residual-accept-reject mechanism is clear (Qin et al., 2024).

A plausible implication is that DSP is less a single algorithm than a reusable systems pattern. Cheap speculative processes can control horizon, support, or width; verifiers can take the form of exact target checks, replanning triggers, safety filters, or cached auxiliary computations; and the dominant bottleneck may lie at token, action, agent, or search-node granularity. The central open problem across the literature is therefore not whether speculation helps, but how to allocate speculative budget online so that latency, cost, and correctness remain jointly favorable under changing states, domains, and system loads.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Dynamic Speculative Planning (DSP).