Papers
Topics
Authors
Recent
Search
2000 character limit reached

Dynamic Sliding Block (DSB)

Updated 23 June 2026
  • Dynamic Sliding Block (DSB) is a training-free, adaptive scheduling strategy that adjusts decoding block boundaries in diffusion LLMs based on token semantic difficulty.
  • It introduces a novel DSB Cache mechanism to efficiently manage KV-state recomputation, ensuring improved throughput and preserved causal order.
  • Empirical evaluations show DSB enhances accuracy by up to 3 points and boosts tokens per second by 8–10% compared to fixed block scheduling.

Dynamic Sliding Block (DSB) is a training-free, adaptive block-scheduling strategy for diffusion LLMs (dLLMs) designed to address the limitations of fixed, predefined block schedules in the context of parallel text generation. DSB dynamically adapts the active decoding block’s position and size based on the semantic difficulty of token positions, maximizing inference efficiency and output quality while preserving global left-to-right order alignment. The method introduces algorithmic innovations for block inference and caching—most notably the DSB Cache—to efficiently handle shifting block boundaries and support rapid generation with negligible accuracy loss. Empirical results across diverse models and text benchmarks demonstrate substantial improvements in both throughput and generation quality compared to prior approaches, constituting a new paradigm for blockwise decoding in dLLMs (Luo et al., 5 Feb 2026).

1. Motivation and Background

Diffusion LLMs (dLLMs) generate text by unmasking tokens iteratively within a fully masked response of length LL, typically selecting the position with the highest model confidence ci(t)=maxvVpθ(yi=vX,y(t))c_i^{(t)} = \max_{v\in \mathcal{V}} p_\theta(y_i = v \mid X, y^{(t)}) at each step. In global parallel decoding, this approach is susceptible to “order misalignment,” where tokens of high semantic confidence that naturally appear early in the output may be unmasked after their more difficult neighbors, undermining the causal left-to-right structure critical to textual coherence.

Block inference, wherein the response is partitioned into K=L/BK = L / B contiguous blocks of fixed size BB, is widely employed to reestablish causality by forcing the decoder to finish block kk before proceeding to block k+1k+1. However, naive fixed-block schedules are agnostic to token-level semantic difficulty, causing unnecessary delays for positions at the boundary that are already highly confident and premature commitments for difficult tokens within the block. Semantic difficulty for each token can be estimated as di(t)=1ci(t)d_i^{(t)} = 1 - c_i^{(t)}, but naive schedules ignore this signal entirely.

2. Formal Methodology of Dynamic Sliding Block

At each denoising step tt, DSB maintains an active block B(t)=[s(t),e(t))B^{(t)} = [s^{(t)}, e^{(t)}) whose left (s(t)s^{(t)}) and right (ci(t)=maxvVpθ(yi=vX,y(t))c_i^{(t)} = \max_{v\in \mathcal{V}} p_\theta(y_i = v \mid X, y^{(t)})0) boundaries are updated dynamically:

  • The left boundary ci(t)=maxvVpθ(yi=vX,y(t))c_i^{(t)} = \max_{v\in \mathcal{V}} p_\theta(y_i = v \mid X, y^{(t)})1 slides forward to the earliest still-masked position in ci(t)=maxvVpθ(yi=vX,y(t))c_i^{(t)} = \max_{v\in \mathcal{V}} p_\theta(y_i = v \mid X, y^{(t)})2, or to ci(t)=maxvVpθ(yi=vX,y(t))c_i^{(t)} = \max_{v\in \mathcal{V}} p_\theta(y_i = v \mid X, y^{(t)})3 if all tokens in ci(t)=maxvVpθ(yi=vX,y(t))c_i^{(t)} = \max_{v\in \mathcal{V}} p_\theta(y_i = v \mid X, y^{(t)})4 are unmasked.
  • The right boundary ci(t)=maxvVpθ(yi=vX,y(t))c_i^{(t)} = \max_{v\in \mathcal{V}} p_\theta(y_i = v \mid X, y^{(t)})5 grows to accommodate ci(t)=maxvVpθ(yi=vX,y(t))c_i^{(t)} = \max_{v\in \mathcal{V}} p_\theta(y_i = v \mid X, y^{(t)})6 unresolved tokens past the prompt, up to a hard cap ci(t)=maxvVpθ(yi=vX,y(t))c_i^{(t)} = \max_{v\in \mathcal{V}} p_\theta(y_i = v \mid X, y^{(t)})7.

Let ci(t)=maxvVpθ(yi=vX,y(t))c_i^{(t)} = \max_{v\in \mathcal{V}} p_\theta(y_i = v \mid X, y^{(t)})8 denote the number of tokens fully unmasked by step ci(t)=maxvVpθ(yi=vX,y(t))c_i^{(t)} = \max_{v\in \mathcal{V}} p_\theta(y_i = v \mid X, y^{(t)})9, and K=L/BK = L / B0 denote the prompt length.

K=L/BK = L / B1

K=L/BK = L / B2

A pseudocode summary (verbatim from Algorithm 1 in the source) describes token selection, unmasking, K=L/BK = L / B3 updating, boundary sliding, and block growth.

Algorithmic variants include:

  • DSB(const.): K=L/BK = L / B4, yielding a fixed-size sliding window.
  • DSB(greedy): K=L/BK = L / B5, yielding a maximal “greedy” block that opportunistically decodes easy tokens without a hard cap.

3. DSB Cache: Efficient KV-State Management

Traditional blockwise caching strategies (prefix cache or dual cache) recompute key-value (KV) states only upon block completion. This is inadequate for DSB, whose block boundaries move adaptively and may result in stale or inconsistent KV states.

DSB Cache resolves this by:

  • Maintaining a prefix window of length K=L/BK = L / B6 immediately before the block.
  • Recomputing KV states for the union of the prefix window and the active block at every step.
  • Caching and reusing all other KV states once positions exit the prefix window in K=L/BK = L / B7 time per token.
  • Performing a global cache refresh after every K=L/BK = L / B8 new tokens are unmasked to avoid long-term drift.

Memory complexity remains K=L/BK = L / B9, but computation per step drops from BB0 to BB1 for newly computed regions plus BB2 per cached token. This results in substantial runtime speedup on hardware accelerators with no significant loss in generation quality.

4. Theoretical Foundations and Complexity

No formal convergence theorems or upper bounds on the number of DSB iterations are provided. In the worst case, DSB requires up to BB3 denoising steps, where each step involves BB4 self-attention in the block and BB5 for the prefix window. Empirical analysis demonstrates stable computational cost and efficiency: DSB never exceeds the runtime of naive block decoding at the same BB6.

A plausible implication is that, despite the absence of formal guarantees, DSB’s design is “no-regret” with respect to realistic workload and hardware constraints.

5. Empirical Evaluations

Extensive experiments were performed across LLaDA-8B-Instruct, LLaDA-1.5, Dream-Base-7B, and Dream-Instruct-7B on GSM8K, MATH, HumanEval, MBPP, and BBH. Key performance metrics include accuracy (%) and tokens per second (TPS). The results indicate that:

  • DSB(const.) improves both accuracy (+1–3 pts) and TPS (+3–5) over naive blocks under parallel decoding.
  • DSB(greedy) further increases TPS with a marginal trade-off in accuracy.
  • The introduction of DSB Cache leads to notable gains: e.g., on GSM8K, DSB + DSB Cache increases accuracy by up to 3 pts and throughput by 8–10% over naive block + Dual Cache.

Abridged benchmarking results:

Decode+Cache Block GSM8K (Acc/TPS) HumanEval (Acc/TPS)
Confidence Naive 77.26 / 48.7 40.85 / 119.1
Confidence DSB(const.) 78.17 / 50.4 42.07 / 124.6
Confidence+DSB Cache DSB(const.) 80.14 / 98.1 37.80 / 105.3
Confidence+DSB Cache DSB(greedy) 80.29 / 99.6 39.63 / 107.7

Ablation studies demonstrate that the prefix window is critical; omitting it (i.e., naive Dual Cache) results in 3–4 pts drop in accuracy and 15–20% TPS decline. DSB is robust to variations in BB7, BB8, BB9, and kk0, consistently outperforming naive block schedules.

6. Significance, Limitations, and Future Research

DSB’s primary contributions are:

  • An adaptive block schedule that dynamically tracks semantic difficulty, sliding block boundaries as soon as confidence warrants, and maximizing model parallelism while faithfully preserving word order.
  • DSB Cache, a KV-state management innovation that supports frequent boundary movement and enhances throughput with minimal impact on output quality.
  • Demonstrated generality and efficiency over a range of dLLMs and benchmarks.

Identified limitations include the lack of theoretical bounds on inference steps and convergence, and the fact that DSB is presently a training-free, inference-time method only. Potential future directions include integrating dynamic block scheduling into LLM pretraining or fine-tuning paradigms, and establishing formal theoretical guarantees for iteration complexity and convergence (Luo et al., 5 Feb 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Dynamic Sliding Block (DSB).