Region-dependent effective noising levels for motion-guided video diffusion

Establish whether, in image-conditioned image-to-video diffusion models guided by a user-provided warped reference video, different regions require distinct effective noising levels during sampling—specifically, demonstrate that user-specified masked regions should be initialized at a lower noise level to enforce strong adherence to the motion signal, while unmasked regions should be initialized at a higher noise level to allow weaker enforcement and natural background adaptation.

Background

The paper adapts SDEdit to video by using a crudely warped reference video as motion guidance. With a single noising timestep, enforcing motion everywhere either overconstrains the background (if noise is low) or drifts from the intended motion (if noise is high). To address this trade-off, the authors propose dual-clock denoising and explicitly conjecture that different regions should start from different noise levels: strong alignment in masked (motion-specified) regions and weaker alignment elsewhere.

This conjecture motivates their inference-time dual-clock blending strategy, but the claim itself is not theoretically established; confirming the necessity or effectiveness of region-dependent noise levels would provide a principled foundation for the proposed approach.

References

We therefore conjecture that different regions require different effective noising levels: masked regions demand strong adherence to the motion signal, achieved with less noising ($t_{\text{strong}$), while unmasked regions benefit from weaker enforcement, achieved with increased noising ($t_{\text{weak}$).

— Time-to-Move: Training-Free Motion Controlled Video Generation via Dual-Clock Denoising (2511.08633 - Singer et al., 9 Nov 2025) in Section 3.3 (Region-Dependent Dual-Clock Denoising)

Region-dependent effective noising levels for motion-guided video diffusion

Sponsor

Background

References

Related Problems