Papers
Topics
Authors
Recent
2000 character limit reached

EMA-Sink: Streaming Video Diffusion

Updated 5 December 2025
  • EMA-Sink is a mechanism that uses an exponential moving average fusion to maintain long-term context and adapt to recent video dynamics.
  • It integrates with the Reward Forcing framework to enable efficient, real-time streaming generation while preventing static frame copying and reducing temporal drift.
  • Empirical studies demonstrate significant improvements in dynamic scores, smoothness, and drift control at 23.1 FPS without additional computational cost.

EMA-Sink is a mechanism designed to enable video diffusion models to perform efficient streaming generation while maintaining long-term temporal fidelity and robust motion dynamics. It provides a method for preserving global context in models restricted to local sliding-window attention, addresses the issue of excessive dependence on initial frames, and is central to the Reward Forcing framework for video generation (Lu et al., 4 Dec 2025).

1. Motivation and Problem Formulation

In streaming video generation, autoregressive models with sliding-window attention tend to suffer from two principal issues: (a) loss of distant context as past key-value (KV) pairs are evicted from the cache, and (b) over-reliance on static initial frames, which, if used as fixed "sink" tokens, induce frame copying and undermine motion dynamics. EMA-Sink was developed to address these concerns by providing a fixed-size, constantly updated token that summarizes evicted context via an Exponential Moving Average (EMA), thereby maintaining long-range coherence and adapting to recent dynamics at no additional computational cost (Lu et al., 4 Dec 2025).

2. Architectural Design and Mathematical Definition

Let ii denote the current frame index, ww the local window size, and (Kj,Vj)(K^j, V^j) the key and value tensors associated with frame jj. The design maintains:

  • Local Cache: Stores KV pairs for recent ww frames: {Ki−w+1,…,Ki−1}\{K^{i-w+1},\dots,K^{i-1}\}, {Vi−w+1,…,Vi−1}\{V^{i-w+1},\dots,V^{i-1}\}.
  • Global Sink State: Maintains sink tokens SKi−1,SVi−1S_K^{i-1}, S_V^{i-1} (fixed shape).

Each time (Ki−w,Vi−w)(K^{i-w}, V^{i-w}) is evicted, it is fused into the sink using EMA:

SKi=αSKi−1+(1−α)Ki−wS_K^i = \alpha S_K^{i-1} + (1-\alpha) K^{i-w}

SVi=αSVi−1+(1−α)Vi−wS_V^i = \alpha S_V^{i-1} + (1-\alpha) V^{i-w}

where α∈(0,1)\alpha \in (0,1) is the decay factor. At generation step ii, attention layers use the concatenated sink and cached KV pairs ([SKi;Ki−w+1:i−1],[SVi;Vi−w+1:i−1])([S_K^i; K^{i-w+1:i-1}], [S_V^i; V^{i-w+1:i-1}]), enabling the model to access both the most recent and a compressed history of all earlier frames.

3. Role in Reward Forcing Video Generation

EMA-Sink is a foundational component of the Reward Forcing framework (Lu et al., 4 Dec 2025), which aims to distill a slow, bidirectional teacher (diffusion) model into a fast, causal student, with:

  1. Global coherence: The sink token ensures persistent access to distant context without growing memory or compute.
  2. Dynamic adaptation: The EMA continuously incorporates new evicted frames, providing a fading but dynamically updated summary.
  3. Motion preservation: Prevents static copying of initial conditions by reducing over-weighting of early frames.

The framework further integrates Rewarded Distribution Matching Distillation (Re-DMD), which biases the student's output distribution toward high-reward, high-motion regions.

4. Ablation Studies and Empirical Impact

Ablation experiments show the indispensability of EMA-Sink for high-fidelity streaming generation (Lu et al., 4 Dec 2025):

  • Removing EMA (using static sink) causes dynamic scores to drop sharply, frame smoothness and drift worsen, and initial frame copying increases.
  • Removing the Sink (pure window attention) results in poor quality and severe drift.
  • Varying α\alpha: High values (e.g., $0.99$) optimize smoothness and maintain drift (~2.52), while lower α\alpha increases drift and motion instability.
  • The combination of EMA-Sink and Re-DMD yields state-of-the-art results on standard video benchmarks, e.g., 23.1 FPS, VBench scores (Total 84.13, Dynamic 66.95), outperforming prior models on long-horizon video tasks.
Variant Dynamic Score Drift Smoothness
EMA-Sink + Re-DMD 64.06 2.51 98.96
Static Sink 35.15 ↑ ↓
Sliding Window only ↓ 5.08 ↓

Removing EMA-Sink reliably yields substantial degradation in dynamic content and temporal consistency.

5. Computational and Practical Characteristics

EMA-Sink provides the following computational benefits:

  • Fixed Memory Cost: Sink tokens do not grow as the video length increases.
  • No Additional Computation: Each update requires a single weighted sum.
  • Streaming Suitability: Enables autoregressive generation of arbitrarily long sequences in real time, as demonstrated at 23.1 FPS on H100 hardware (Lu et al., 4 Dec 2025).

For practical model construction, the only new hyperparameter is the EMA decay α\alpha, which can be tuned to balance long-term memory versus reactivity to recent frames.

6. Generalization and Research Relevance

EMA-Sink generalizes beyond video to any sequence generation domain where local memory must be augmented without incurring global context cost. A plausible implication is that similar EMA fusion mechanisms may enhance transformer-based models operating on long sequences with local attention constraints, including in text, audio, and multimodal settings. Its empirical success within Reward Forcing suggests broader utility for context compression and temporal conditioning in high-throughput generative pipelines.

The mechanism's originality lies in its simultaneous guarantee of persistent, smoothly fading global context and strictly bounded memory usage, distinguishing it from alternatives such as full memory replay or static sink tokens. Its deployment in conjunction with reward-driven distillation highlights new directions in targeted sequence modeling, where selective preservation of dynamic events enables more realistic and coherent generation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to EMA-Sink.