Speculative Decoding for Autoregressive Video Generation

Published 19 Apr 2026 in cs.CV and cs.AI | (2604.17397v1)

Abstract: Autoregressive video diffusion is emerging as a promising paradigm for streaming video synthesis, with step distillation serving as the primary means of accelerating inference. Whether speculative decoding, the dominant acceleration strategy for LLMs, can be effectively adapted to autoregressive video generation remains an open question, because video blocks are continuous spatiotemporal tensors with no token-level distribution for exact rejection sampling. We introduce SDVG, which brings speculative decoding to block-based autoregressive video diffusion by replacing token verification with an image-quality router. A 1.3B drafter proposes candidate blocks via four denoising steps; each block is VAE-decoded and scored by ImageReward using worst-frame aggregation--taking the minimum per-frame reward to catch single-frame artifacts that averaging would mask. Blocks scoring above a fixed threshold tau are accepted into the 14B target's KV cache; the rest are regenerated by the target. Two additional design choices prove critical: the first block is always force-rejected to anchor scene composition, and tau serves as a single knob that traces a smooth quality-speed Pareto frontier. On 1003 MovieGenVideoBench prompts (832x480), SDVG retains 98.1% of target-only VisionReward quality (0.0773 vs. 0.0788) at a 1.59x speedup with tau=-0.7, and reaches 2.09x at 95.7% quality retention--while consistently outperforming draft-only generation by over +17%. The framework is training-free, requires no architectural changes, and can be seamlessly integrated into existing autoregressive video generation pipelines.

Abstract PDF Upgrade to Chat

Authors (2)

Summary

The paper introduces SDVG, a framework that pairs a small drafter with a large target model to accelerate autoregressive video generation.
It employs worst-frame scoring as a quality metric, achieving up to 2.09× speedup while preserving nearly 98% of target model fidelity.
SDVG offers a training-free, plug-and-play method that enables a flexible quality–speed tradeoff using a single threshold parameter.

Speculative Decoding for Autoregressive Video Generation: A Technical Summary

Introduction

The paper "Speculative Decoding for Autoregressive Video Generation" (2604.17397) addresses the inference inefficiencies present in large-scale, blockwise autoregressive video diffusion models. While these architectures enable high-fidelity, streaming synthesis, they incur significant computational costs at scale, with state-of-the-art 10B+ parameter models requiring top-end hardware even for real-time operation. The core question examined is whether the throughput of smaller video generators can be harnessed without compromising the generative quality afforded by the largest models.

Autoregressive video generation, inspired by language modeling, produces content in a causal fashion by conditioning each generated video block on the history maintained in a KV cache. This setup directly supports streaming inference and mitigates exposure bias—issues encountered in non-autoregressive and teacher-forced diffusion approaches.

Existing acceleration strategies in both diffusion and autoregressive contexts include:

Step Distillation and Compositional Sampling: These trade precision for speed, targeting denoising steps but do not directly address model parameter count [yin2024onestepdiffusiondistributionmatching].
Split Trajectory Methods (e.g., T-Stitch, SRDiffusion, HybridStitch): Work by allocating portions of the denoising sequence to smaller models, but lack adaptive correction and impose content-agnostic splits [pan2024tstitchacceleratingsamplingpretrained, cheng2025srdiffusionacceleratevideodiffusion, sun2026hybridstitch].
Request Routing (e.g., MoDM): Routes entire requests to smaller models based on cache hits, but provides no per-block adaptivity or guarantees.

In language modeling, speculative decoding [leviathan2023fastinferencetransformersspeculative, chen2023acceleratinglargelanguagemodel] has become the default mechanism for throughput improvement, using a small model to draft sequences and a large model to either accept or regenerate outputs based on token-level comparison. This approach leverages exact token probability and rejection sampling, which is fundamentally non-applicable in the high-dimensional, continuous-tensor space of video generation.

SDVG: Method and Design

Framework

The proposed SDVG (Speculative Decoding for Autoregressive Video Generation) framework adapts speculative decoding to spatiotemporal video blocks by pairing a small "drafter" with a large "target" model. For each block, the drafter proposes candidate frames using a reduced number of denoising steps. These candidates are VAE-decoded and scored via ImageReward, an established scalar image-text reward model.

Routing and Acceptance Policy

The SDVG pipeline replaces probabilistic token verification with an image-quality router:

Each block is scored, and if the minimum reward over all frames (worst-frame aggregation) exceeds a threshold $\tau$ , the candidate block is accepted into the target's KV cache.
Blocks with scores below $\tau$ are regenerated using the large model with full resolution.
To prevent cumulative scene composition errors, the first block is always force-rejected and generated by the target.

Key design characteristics:

Threshold $\tau$ serves as a single, easy-to-tune quality-speed tradeoff knob.
The criterion is block-level, does not require any step- or trajectory-level engineering, and is applicable to existing autoregressive pipelines without retraining or architectural modification.

Motivation for Worst-frame Scoring

Empirically, mean frame scoring can obscure single-frame degradation artifacts (e.g., temporal flicker), so the block's worst frame determines block acceptance. This makes the method conservative, preventing quality loss propagated from rare but severe per-frame errors.

Experimental Results

Protocol

Experiments leverage:

Drafter: Wan2.1-T2V-1.3B (small, efficient)
Target: Krea Realtime Video 14B (high-fidelity, computationally expensive)
Evaluation includes 1003 prompts from MovieGenVideoBench at ${832 \times 480}$ resolution with VisionReward [xu2026visionrewardfinegrainedmultidimensionalhuman] as the principal metric, and measures wall-clock speedup.
Baselines: draft-only (only drafter), target-only (only target), and step-level ablations.

Main Results

SDVG consistently demonstrates strong quality–efficiency tradeoff:

At $\tau = -0.7$ , SDVG achieves 98.1% of target-only VisionReward (0.0773 vs. 0.0788) with a 1.59 $\times$ speedup.
At $\tau = -2.5$ , speedup increases to 2.09 $\times$ , retaining 95.7% of target quality—still maintaining a >17% advantage over draft-only generation.
The transition from stricter to looser $\tau$ values traces a smooth Pareto frontier, enabling principled deployment configuration.

Ablations

Replacing ImageReward routing with random routing markedly reduces quality, confirming the necessity of a load-bearing routing signal.
Substituting worst-frame aggregation with average-frame scoring introduces artifacts undetectable by block means, further degrading output quality and accentuating the need for strong per-block frame filtering.

Limitations

Three principal limitations are identified:

Distributional Shift: Absence of exact-probability acceptance as in LLMs induces small shifts towards drafter characteristics, most prominently at high accept rates.
Router Model Constraints: ImageReward, as an image-text reward model, cannot capture motion or temporal consistency artifacts. An advanced, video block-aware reward model would potentially improve quality routing.
Wasted Compute: For rejected blocks, the drafter's computation (including decoding) is discarded, an overhead which could potentially be mitigated via batching or speculative VAE techniques.

Implications and Future Directions

SDVG provides a composable, inference-only speedup mechanism, orthogonal to trajectory-splitting schemes, for block-based autoregressive video models. This enables straightforward quality–throughput tradeoffs without system-level complexity or retraining. The method is readily deployable and can synergize with hardware-specific and step-distillation acceleration strategies for further uplift.

Theoretically, the replacement of explicit logit-based accept/reject in speculative decoding with reward-guided proxies highlights an important axis for future research: development of robust, temporally consistent, and multimodal reward models that can serve as effective acceptance oracles for structured, continuous generative tasks beyond language.

On the practical front, SDVG could drive real-time, high-fidelity video generation within constrained latency environments, including interactive media, video synthesis for AV/VR, and dynamic video editing toolchains.

Conclusion

SDVG introduces a training-free, plug-and-play speculative decoding framework for streaming autoregressive video diffusion, matching large-model output quality with up to twice the throughput. Key design choices—worst-frame aggregation and initial block force-rejection—are validated as critical for robust quality. The framework imposes no architecture changes and operates with a single, interpretable parameter for navigating the quality–speed tradeoff. This work sets a new baseline for collaborative, reward-guided, inference-time compute allocation in video generative modeling (2604.17397).