Interleaved-Frame Smoother

Updated 25 November 2025

Interleaved-frame smoothing is defined as a method that alternates processing subsets of frames to effectively remove flicker and ghosting artifacts.
The approach employs iterative, local interpolation and deformable alignment techniques to enforce temporal consistency with minimal computational overhead.
Key applications include improving diffusion-based video synthesis, frame interpolation, and deinterlacing in both generative and restoration pipelines.

An interleaved-frame smoother is a family of algorithmic techniques that target temporal inconsistencies—such as structural flicker, ghosting, or artifacts—across sequences of video frames or temporally-sampled features by deploying interleaved, often alternated, strategies for blending, alignment, or interpolation. Such modules are widely adopted in modern video generative models, frame-interpolation pipelines, and deinterlacing networks. The canonical pattern involves iterative or multi-path smoothing, where only a subset of frames or spatial tokens is processed on each pass in an interleaved arrangement, enabling efficient, distributed propagation of temporal context and enforcing multi-frame consistency at minimal computational overhead (Zhang et al., 2023, Zhao et al., 2021, Zhang et al., 2 Jul 2024).

1. Motivation and Context

Structural flicker and temporal inconsistency remain persistent challenges for diffusion-based video synthesis and video enhancement, even when appearance coherence is promoted by means such as cross-frame self-attention or explicit motion condition inputs. In such contexts, purely frame-wise denoising or restoration permits stochastic prediction drift, leading to visually perceivable jitter or ghost artifacts. Classical approaches relying on heavy retraining, spatio-temporal regularization, or dense multi-frame losses are often computationally prohibitive at inference or lack sufficient granularity of correction. The interleaved-frame smoother emerges as a lightweight, training-free, or plug-in remedy, leveraging off-the-shelf interpolation, deformable alignment, or sequence modeling with a systematic interleaving pattern for near-real-time temporal refinement (Zhang et al., 2023, Zhao et al., 2021, Zhang et al., 2 Jul 2024).

2. Algorithmic Structure and Formalism

The specific realization of an interleaved-frame smoother exhibits domain-dependent variants, with a recurring high-level structure:

Stepwise Interleaving: At each designated smoothing step or module, select a subset of frames or spatial tokens, alternating between even and odd-indexed positions across iterations, so all triplets (or pairs) are covered in interleaved succession.
Local Smoothing/Interpolation: For each selected index $i$ in the current interleaving set $P$ , compute an "interpolated" frame or feature (e.g., via an explicit interpolation network, deformable alignment, or sequence model).
Partial Replacement: Replace only the indexed frames/features of $P$ , leaving others untouched; in the subsequent step, swap to the complementary set.
Iterative Update: Continue for a small number of steps, ensuring no frame escapes smoothing for more than one iteration.

Example: ControlVideo (Zhang et al., 2023)

Given $N$ frames and a sampling step $t$ , with noisy latent $z_t$ , define the "clean" prediction:

$\hat{z}_t = \frac{z_t - \sqrt{1-\alpha_t} \epsilon_\theta(z_t, t, c, \tau)}{\sqrt{\alpha_t}}$

Decode: $x_t = D(\hat{z}_t)$ , yielding frames $\{x_t^0,\ldots, x_t^{N-1}\}$ .

Define $P$ :

If $t$ even: $P = \{i \mid i\ \text{mod}\ 2 = 0,\, 1 \leq i \leq N-2\}$
If $t$ odd: $P = \{i \mid i\ \text{mod}\ 2 = 1,\, 1 \leq i \leq N-2\}$

For $i \in P$ ,

$\tilde{x}^i = F(x^{i-1}, x^{i+1})$

using a pretrained interpolation network $F$ (e.g., RIFE), then re-encode: $\tilde{z}_t = E(\tilde{x})$ .

Backward update: $z_{t-1} = \sqrt{\alpha_{t-1}} \tilde{z}_t + \sqrt{1-\alpha_{t-1}} \epsilon_\theta(z_t, t, c, \tau)$

This guarantees overlapping triplets propagate temporal information bidirectionally across the sequence within two smoothing steps.

3. Variants and Architectures

3.1 Diffusion Model Smoothing

In training-free text-to-video generation frameworks such as ControlVideo (Zhang et al., 2023), the interleaved-frame smoother functions interstitially with the denoising process. It decodes a current clean sequence, interpolates every other frame using pretrained RIFE (a lightweight CNN trained for optical flow-based frame synthesis), re-encodes, and continues DDIM-based sampling. Integration at only two timesteps (e.g., $t=30,31$ out of 50) smooths the entire sequence with minimal extra overhead (≈30 s per smoothing step for a 15-frame, 512×512 video).

3.2 Deinterlacing and Artifact Removal

For historical interlaced video restoration, multi-frame architectures resembling interleaved-frame smoothing combine vertical field splitting, temporal alignment through deformable convolutions, and residual refinement (Zhao et al., 2021). Here, spatial vertical interpolation reconstructs missing scan lines (undoing interlacing), temporal alignment fuses features from adjacent time points, and the final reconstruction module outputs temporally-consistent, flicker-free progressive frames. The process can be viewed as a joint, interleaved spatial-temporal smoother leveraging redundancy over triplets or longer sequences.

3.3 State-Space Interleaving in VFI

Recent video frame interpolation networks such as VFIMamba (Zhang et al., 2 Jul 2024) employ interleaved token rearrangement in feature space. For two frames $F_0, F_1 \in \mathbb{R}^{H \times W \times C}$ , features are rearranged to interleave spatial tokens, creating a "super-image" of $H \times 2W$ , then processed by multi-directional Selective State Space Models (S6). Four scan directions (→, ←, ↓, ↑), each with separate S6 recurrence, are summed to spatially propagate information. This model preserves linear complexity and global temporal context, and the interleaving mirrors the principle of distributing smoothing responsibilities across the entire spatial-temporal domain efficiently.

4. Quantitative Impact and Empirical Results

Interleaved-frame smoothing demonstrably improves both temporal consistency and perceptual quality metrics:

ControlVideo (Zhang et al., 2023): On 125 prompt-structure pairs at 512×512 resolution (15 frames), frame consistency improved from 95.36% (cross-frame only) to 96.83% (+1.47 pt) and sampling time increased marginally from ≈3.0 to ≈3.5 minutes. Prompt CLIP similarity remained nearly unchanged (30.76 vs. 30.79).
MFDIN for interlaced video (Zhao et al., 2021): Achieves 32.76 PSNR / 0.9132 SSIM on YOUKU-2K synthetic sets, outperforming baselines (e.g., DIN 32.01 / 0.9071). Human Single-Stimulus Impairment Scale (SSIS) scores on real-world videos average ≈4.2, significantly above prior methods.
VFIMamba (Zhang et al., 2 Jul 2024): On the X-TEST dataset, interleaved S6-based smoothing contributes to a +2.03 dB gain at 4K over RIFE-style convolutional interpolation, and similar margins over previous bests for 2K and hard subsets, while maintaining linear computational complexity.

5. Practical Implementation Considerations

The interleaved-frame smoother is characterized by minimal invasiveness, test-time deployability, and independence from the core generative or restoration model's training regime. Notable implementation elements include:

Plug-in use of pretrained interpolation networks (e.g., RIFE, ~6 MB).
Use of the same VAE encoder/decoder as the generative model (e.g., Stable Diffusion's VAE).
Smoother invoked only on select denoising timesteps for cost-effectiveness.
For patch-based interleaving (VFIMamba), channel- and direction-aware rearrangement is performed over token grids, with four-directional modeling realized via SSM blocks.
Memory and compute requirements scale linearly with frame count and spatial dimensions for SSM-based variants, and only require additional encode/decode+interpolation cycles in pixel-space interpolators.

Comparison Table: Core Elements Across Domains

Framework	Interleaved Units	Smoothing Operation
ControlVideo	Pixel frames	RIFE interpolation on alternated frames
MFDIN	Fields/frames	Deformable alignment, vertical upsample
VFIMamba	Feature tokens	S6 multi-directional scanning of grid

6. Broader Applications and Extensions

Interleaved-frame smoothing has expanded from its origin in flicker removal for generative diffusion models to broader video restoration, frame-rate upconversion, and artifact suppression tasks. In multi-frame deinterlacing (Zhao et al., 2021), the method not only mitigates flicker but synergistically enables super-resolution, deblocking, and temporal scaling, outperforming pipeline compositions of separate, single-purpose networks. In advanced VFI (VFIMamba (Zhang et al., 2 Jul 2024)), interleaved mixed modeling fuses global context and fine-scale motion while preserving computational tractability, suggesting that interleaving is highly compatible with both CNN and SSM-based architectures.

A plausible implication is that further development of interleaved smoothing schemes—potentially at multiple feature levels and modalities—will enhance the quality and efficiency of future video generative, restoration, and understanding pipelines, especially where temporally global yet cost-efficient context propagation is required.