First Order Motion Model for Image Animation (2003.00196v3)

Published 29 Feb 2020 in cs.CV and cs.AI

Abstract: Image animation consists of generating a video sequence so that an object in a source image is animated according to the motion of a driving video. Our framework addresses this problem without using any annotation or prior information about the specific object to animate. Once trained on a set of videos depicting objects of the same category (e.g. faces, human bodies), our method can be applied to any object of this class. To achieve this, we decouple appearance and motion information using a self-supervised formulation. To support complex motions, we use a representation consisting of a set of learned keypoints along with their local affine transformations. A generator network models occlusions arising during target motions and combines the appearance extracted from the source image and the motion derived from the driving video. Our framework scores best on diverse benchmarks and on a variety of object categories. Our source code is publicly available.

Citations (816)

View on Semantic Scholar

Summary

The paper introduces a self-supervised approach that decouples appearance and motion via a first order Taylor expansion using learned keypoints.
The model employs an occlusion-aware generator that handles local affine transformations to realistically animate complex motions.
Experimental results on high-resolution benchmarks like Tai-Chi-HD and VoxCeleb show significant improvements over prior methods in key evaluation metrics.

First Order Motion Model for Image Animation

The paper "First Order Motion Model for Image Animation" by Aliaksandr Siarohin et al. addresses the challenge of generating video sequences from static images by animating objects according to motion patterns derived from driving videos. This complex task has diverse applications including movie production, photography, and e-commerce.

Methodology

The proposed framework does not rely on any annotated data or prior knowledge about the object to be animated, making it object-agnostic within a specific class such as faces or human bodies. The authors introduce a novel self-supervised learning approach that decouples appearance and motion information by using a first order Taylor motion representation. The motion is modeled through learned keypoints and their local affine transformations. A generator network then combines the appearance extracted from the source image and the motion derived from the driving video, while also handling occlusions that arise during target motion.

Technical Details

The key technical contributions of the paper include:

First Order Motion Representation: Utilizing a self-learned set of keypoints and local affine transformations to model complex motions. This approach effectively enriches the description of the object's motion, leading to higher quality animation.
Occlusion-aware Generator: The generator network adopts an occlusion mask to indicate which parts of the image should be inferred from context due to occlusions. This mechanism is essential for generating realistic animations especially when there are large motion patterns.
Equivariance Loss Extension: Extending the equivariance loss commonly used for keypoint detection to improve local affine transformation estimation. This extension ensures that the learned keypoints and transformations are consistent with known geometric deformations.
High Resolution Dataset: The introduction of the Tai-Chi-HD dataset, which is high resolution and is proposed as a new benchmark for image animation frameworks.

Experimental Results

The framework was evaluated on multiple benchmark datasets including VoxCeleb, Tai-Chi-HD, Fashion-Videos, and MGif. The proposed method demonstrated significant improvements over state-of-the-art techniques in various metrics. Specifically, the authors report the following:

Tai-Chi-HD Dataset: On this dataset, the proposed method achieved an average L1 distance of 0.063, compared to 0.080 by X2Face and 0.077 by Monkey-Net. Additionally, the Average Keypoint Distance (AKD) and Missing Keypoint Rate (MKR) were significantly lower.
VoxCeleb Dataset: A reduction in average Euclidean distance to 0.140 was observed, outperforming both X2Face and Monkey-Net.
User Study: A paired user paper showed a clear preference for the proposed method, with 92.0% of users favoring it over X2Face and 80.6% over Monkey-Net on the Tai-Chi-HD dataset.

Implications and Future Work

The implications of this research are substantial for both practical applications and theoretical advancements in AI. Practically, the framework can be utilized for various applications requiring realistic video animations from static images without the need for annotated data. Theoretically, this work paves the way for future research focusing on improving the expressiveness and generalization of motion models in image animation.

Future developments could involve extending the framework to handle a broader range of objects and motions, further optimizing the occlusion handling mechanism, and enhancing the training efficiency. The introduction of high-resolution datasets like Tai-Chi-HD also suggests a need for more substantial and diverse datasets to evaluate the robustness of image animation frameworks.

In conclusion, the first order motion model for image animation presents a significant step forward in generating realistic video sequences from still images, providing a robust and efficient solution to the complex problem of object animation without relying on extensive prior knowledge or annotations.

PDF Markdown

Related Papers

YouTube

Show All Videos