Region-dependent effective noising levels for motion-guided video diffusion
Establish whether, in image-conditioned image-to-video diffusion models guided by a user-provided warped reference video, different regions require distinct effective noising levels during sampling—specifically, demonstrate that user-specified masked regions should be initialized at a lower noise level to enforce strong adherence to the motion signal, while unmasked regions should be initialized at a higher noise level to allow weaker enforcement and natural background adaptation.
References
We therefore conjecture that different regions require different effective noising levels: masked regions demand strong adherence to the motion signal, achieved with less noising ($t_{\text{strong}$), while unmasked regions benefit from weaker enforcement, achieved with increased noising ($t_{\text{weak}$).