Overlap-and-Blend Temporal Co-Denoising
- The paper presents an overlap-and-blend approach that partitions long videos into overlapping windows and applies independent, Heun-based denoising with weighted fusion for seamless output.
- It leverages precise mathematical formulations in both pixel and latent spaces, using cosine and Hamming window blending to ensure temporal consistency and high fidelity.
- Overlap-and-blend temporal co-denoising supports multi-text and spatial conditionings, enabling scalable video inpainting, outpainting, and robust long-range video editing.
Overlap-and-blend temporal co-denoising is a class of techniques designed to enable temporally consistent generation, editing, or inpainting of long videos with diffusion models, extending the effective length and controllability of outputs beyond the domain of monolithic short-clip models. It achieves this by partitioning the video into overlapping temporal segments ("windows"), independently denoising each window, and then fusing the partial results in overlap regions using smooth weighting schemes. The paradigm has been substantially formalized and empirically validated in both long video generation ("Gen-L-Video: Multi-Text to Long Video Generation via Temporal Co-Denoising" (Wang et al., 2023)) and long video inpainting/outpainting ("Unified Long Video Inpainting and Outpainting via Overlapping High-Order Co-Denoising" (Lyu et al., 5 Nov 2025)), where it enables seamless, scalable, and high-fidelity video synthesis over hundreds of frames, with support for multi-text or spatial conditionings.
1. Mathematical Formulation and General Principles
Overlap-and-blend temporal co-denoising operates in the latent (or pixel) space of a video modeled as a sequence of (or ) frames. The diffusion trajectory at time can be formalized as for the whole video, or , where is the per-frame latent dimension in encoded space. For inpainting/outpainting (Lyu et al., 5 Nov 2025), masked or zero-padded latent buffers are used for the target video region.
At each diffusion timestep, the system works with overlapping windows extracted from the full buffer. For window length (or clip length ) and overlap (stride or specified), the -th window covers frames with and . Adjacent windows necessarily overlap in (or ) frames, which is central to the recombination strategy.
Each window is independently denoised using either a pre-trained short-clip noise predictor or a fine-tuned high-capacity "score model" . The reverse diffusion step may be modeled using first-order DDPM/DDIM stepping or, more effectively, with a second-order Heun (improved Euler) method for enhanced stability and quality. The generation or reconstruction for the whole video is then formulated as the solution to a least-squares blending problem or as a weighted sum in the overlap regions.
2. Clip Extraction, Denoising, and Reverse Diffusion
A key step is the extraction of temporally overlapping sub-clips or windows: or, equivalently,
where is the stride and (or ) the window length.
Forward Process (Training):
with the standard noise schedule .
Reverse Step (Inference):
Each short window is denoised independently via the reverse diffusion model: or, for high-order solvers (Heun's method) (Lyu et al., 5 Nov 2025): Here, is the model's predicted score/noise, and the diffusion step.
3. Overlap Identification and Blending Strategies
Overlap regions between adjacent windows and are precisely:
To merge the independently denoised window outputs into a full-length frame sequence, a blending function assigns per-frame weights (within the window) for aggregation:
- Linear ramp or cosine schedule over overlap region (Wang et al., 2023)
- Hamming window of length (Lyu et al., 5 Nov 2025):
The fused latent at index : This construction guarantees smoothness across window boundaries and optimality in the least-squares sense.
4. Algorithmic Implementation and Pseudocode
A unified pseudocode skeleton for overlap-and-blend temporal co-denoising is:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
for n = 0 to N-1: t = t_n Δt = t_n - t_{n+1} X_{t-Δt} ← 0; denom ← 0 for i in 1..Nwin: s_i = 1 + (i-1)*(W-O) x_t^{(i)} = X_t[s_i : s_i+W-1] # Heun update: k1 = f(x_t^{(i)}, t) x_mid = x_t^{(i)} + (Δt/2)*k1 k2 = f(x_mid, t - Δt/2) x_{t-Δt}^{(i)} = x_t^{(i)} + Δt*(k1 + k2)/2 # Blend: for j in 1..W: idx = s_i + j - 1 weight = H[j] X_{t-Δt}[idx] += weight * x_{t-Δt}^{(i)}[j] denom[idx] += weight for k in 1..T: X_{t-Δt}[k] /= denom[k] X_t ← X_{t-Δt} output_frames ← Decoder(X_0) |
This pipeline is compatible with both pixel-space and VAE-based (latent-space) diffusion models. For multi-text conditioned generation, each window carries its own embedding, and across semantic boundaries a convex combination is assigned per window.
5. Empirical Validations and Ablation Results
Comprehensive ablation studies demonstrate the criticality of various components:
- Heun’s Second-Order Solver vs. First-Order Euler: Second-order integration systematically improves temporal quality. For video inpainting (Lyu et al., 5 Nov 2025), replacing Euler with Heun increased PSNR from 14.78 dB to 15.74 dB (+6.5%), SSIM from 0.515 to 0.603 (+17.1%), and reduced LPIPS from 0.613 to 0.529 (−13.7%).
- Blending Schedule: Hamming window blending eliminates hard seams and "ghosting" at window boundaries, outperforming both hard (no blend) and uniform (mean) schemes.
- Window Length: With –100 and 50% overlap, the system achieves artifact-free long-range coherence without excessive memory use. Too-small windows (<30) cause global drift, while too-large windows exceed GPU capacity.
- Scalability: Sliding window/overlap lets the system support arbitrarily long videos, as demonstrated by successful tests on hundreds of frames on single 80 GB H100 GPUs. Competing baselines (e.g., VACE, Alibaba Wan 2.1) cap out at 81–245 frames before out-of-memory errors.
- Editing Consistency: The overlap strategy enables precise editing (e.g., object addition or removal over hundreds of frames) without visible seams or drift, as required for high-fidelity video inpainting and outpainting.
6. Applications, Conditioning, and Extensions
Overlap-and-blend temporal co-denoising supports:
- Multi-condition generation: Assigning unique text, semantic, or spatial conditionings per window (or per clip) enables compositional generation and fine-grained control.
- Semantic transitions: Convex linear interpolation of conditioning embeddings across windows produces perceptually smooth transitions for scene changes or prompt switching (Wang et al., 2023).
- Bidirectional temporal attention: For further temporal consistency, within-window denoising may use attention mechanisms anchored in window centers to propagate context forward and backward (ensuring matching content on both ends of overlaps).
- Practical long-range video editing: Robust spatially controllable inpainting and outpainting at scale.
A plausible implication is that, by agnostically leveraging pretrained short-clip models and a task-agnostic blending strategy, this paradigm obviates the need to retrain bespoke long-video models, significantly improving efficiency and flexibility.
7. Limitations and Future Directions
The memory and compute efficiency of overlap-and-blend depends on judicious window sizing and overlap, as excessive overlap increases computational redundancy. While the Hamming window blend is highly effective, further refinements (e.g., window shape adaptation or cross-window residual updates) may further reduce residual artifacts. Adoption of higher-order solvers (beyond Heun) remains an open area, although diminishing returns have been empirically observed (Lyu et al., 5 Nov 2025). Integration with advanced text or multimodal conditionings may drive further advances in controllability and sample diversity.
In summary, overlap-and-blend temporal co-denoising constitutes a rigorously formalized, empirically validated approach to scalable, high-fidelity, and consistent long video synthesis and editing, directly leveraging short-clip diffusion models and principled multi-window optimization (Wang et al., 2023, Lyu et al., 5 Nov 2025).
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free