DreamMover: Leveraging the Prior of Diffusion Models for Image Interpolation with Large Motion (2409.09605v2)

Published 15 Sep 2024 in cs.CV

Abstract: We study the problem of generating intermediate images from image pairs with large motion while maintaining semantic consistency. Due to the large motion, the intermediate semantic information may be absent in input images. Existing methods either limit to small motion or focus on topologically similar objects, leading to artifacts and inconsistency in the interpolation results. To overcome this challenge, we delve into pre-trained image diffusion models for their capabilities in semantic cognition and representations, ensuring consistent expression of the absent intermediate semantic representations with the input. To this end, we propose DreamMover, a novel image interpolation framework with three main components: 1) A natural flow estimator based on the diffusion model that can implicitly reason about the semantic correspondence between two images. 2) To avoid the loss of detailed information during fusion, our key insight is to fuse information in two parts, high-level space and low-level space. 3) To enhance the consistency between the generated images and input, we propose the self-attention concatenation and replacement approach. Lastly, we present a challenging benchmark dataset InterpBench to evaluate the semantic consistency of generated results. Extensive experiments demonstrate the effectiveness of our method. Our project is available at https://dreamm0ver.github.io .

Authors (7)

Liao Shen (7 papers)
Tianqi Liu (49 papers)
Huiqiang Sun (9 papers)
Xinyi Ye (14 papers)
Baopu Li (45 papers)
Jianming Zhang (85 papers)
Zhiguo Cao (88 papers)

Citations (2)

View on Semantic Scholar

Summary

The paper introduces DreamMover, a novel framework that leverages pre-trained diffusion models to interpolate images with large motion, improving metrics like FID and LPIPS.
The paper proposes a two-level fusion strategy and self-attention techniques to preserve high-frequency details and ensure semantic consistency during the interpolation process.
The paper demonstrates significant improvements on InterpBench, paving the way for advanced applications in video generation and computer vision research.

DreamMover: Advancements in Image Interpolation with Large Motion

The paper "DreamMover: Leveraging the Prior of Diffusion Models for Image Interpolation with Large Motion" presents a sophisticated method aimed at addressing the problem of generating intermediate images from pairs of images where significant motion occurs. This paper introduces a new framework, DreamMover, which capitalizes on the capabilities of pre-trained image diffusion models to ensure semantic consistency between the generated intermediate images and the input images.

The authors first identify the limitations of existing methods in handling large object motion and maintaining semantic details. These methods either limit themselves to small motion or topologically similar objects, resulting in artifacts and inconsistencies. To circumvent these challenges, DreamMover incorporates several novel components:

Natural Flow Estimator: This component leverages a pre-trained diffusion model to implicitly reason about the semantic correspondence between two images. By calculating cosine distances between feature maps extracted from the up-blocks of U-Net during the noise-adding process, DreamMover calculates bidirectional optical flow maps, providing a robust way to determine pixel correspondences even in the absence of intermediate semantic information.
Two-Level Fusion Strategy: To preserve high-frequency details, the authors propose a unique strategy where the noisy latent code is decomposed into high-level and low-level parts. The high-level part undergoes time-weighted interpolation and softmax splatting in latent space, maintaining overall spatial layout information. The low-level part, which contains high-frequency details, is fused using the Winner-Takes-All (WTA) method, ensuring minimal loss of detail.
Self-Attention Concatenation and Replacement: To enhance the consistency between generated images and input images during the denoising stage, DreamMover concatenates and replaces self-attention key and value pairs from the input images. This mechanism allows for the retention of detailed textures and semantic features, further consolidating the consistency in the generated sequence. Additionally, LoRA (Low-Rank Adaptation) fine-tuning is employed to further maintain semantic consistency.

The authors introduce InterpBench, a challenging benchmark dataset designed to evaluate the semantic consistency of image interpolation methods. Extensive experiments on InterpBench demonstrate the effectiveness of DreamMover, with results showing significant improvements over state-of-the-art methods in terms of FID, LPIPS, warping error, and $\text{WE}_{mid}$ metrics. The proposed method also scored favorably in user studies, highlighting its superior perceptual quality and realism.

The implications of this research are manifold:

Practical Implications: DreamMover can greatly enhance applications in short-video generation, providing a tool for creating seamless and semantically consistent animations from pairs of images with large motion. This can be particularly beneficial for platforms like TikTok and YouTube Shorts, where high visual fidelity and engaging content are paramount.
Theoretical Implications: The work paves the way for further exploration into leveraging pre-trained diffusion models for other challenging tasks in computer vision and graphics. By demonstrating the effectiveness of semantic flow estimation and two-level fusion, the authors provide a robust framework that can potentially be extended to other domains.

Looking forward, DreamMover opens up several avenues for future research in AI:

Enhanced Flow Estimation: Further refining the flow estimation process could improve the capability to handle even more complex motions and detailed textures.
Higher Resolution Applications: Exploring ways to apply diffusion models at higher resolutions could mitigate the texture-sticking issue noted by the authors.
Broader Benchmarking: Expanding the dataset and benchmarks like InterpBench will be vital for more comprehensive evaluations of image interpolation methods.

In conclusion, DreamMover represents a significant advancement in leveraging diffusion models for large motion image interpolation. By ensuring semantic consistency and high-fidelity outcomes, it provides a robust solution to a longstanding problem in the field, promising enriched applications in video generation and image editing domains. The methodology and results set a new standard for researchers and practitioners aiming to navigate the complexities of image interpolation in dynamic scenarios.

PDF Markdown

Related Papers

GitHub

Tweets

https://twitter.com/ducha_aiki/status/1836024453714579570