Synthesizing Moving People with 3D Control (2401.10889v2)

Published 19 Jan 2024 in cs.CV and cs.AI

Abstract: In this paper, we present a diffusion model-based framework for animating people from a single image for a given target 3D motion sequence. Our approach has two core components: a) learning priors about invisible parts of the human body and clothing, and b) rendering novel body poses with proper clothing and texture. For the first part, we learn an in-filling diffusion model to hallucinate unseen parts of a person given a single image. We train this model on texture map space, which makes it more sample-efficient since it is invariant to pose and viewpoint. Second, we develop a diffusion-based rendering pipeline, which is controlled by 3D human poses. This produces realistic renderings of novel poses of the person, including clothing, hair, and plausible in-filling of unseen regions. This disentangled approach allows our method to generate a sequence of images that are faithful to the target motion in the 3D pose and, to the input image in terms of visual similarity. In addition to that, the 3D control allows various synthetic camera trajectories to render a person. Our experiments show that our method is resilient in generating prolonged motions and varied challenging and complex poses compared to prior methods. Please check our website for more details: https://boyiliee.github.io/3DHM.github.io/.

References (35)

Citations (1)

View on Semantic Scholar

Summary

The paper introduces the 3DHM framework that synthesizes full-body texture maps and realistic poses from a single image.
It employs an in-filling diffusion model to generate unseen body parts and a diffusion-based rendering pipeline to simulate fabric motion and hair dynamics.
Experimental results show that 3DHM outperforms state-of-the-art methods by achieving temporally coherent and highly detailed human animations.

Introduction

The recently developed framework known as 3DHM is setting a new standard in the field of computer vision and graphics. Aimed at pushing the boundaries of how we animate images, this cutting-edge framework synthesizes human motions from a single static image by invoking a two-part process that firstly learns to in-paint texture maps and then renders the human form in novel poses with enhanced realism.

Learning to Visualize the Unseen

The initial stage of 3DHM focuses on creating a complete picture—or texture map—of an individual from a single image. Usually, when we have a photograph of someone, it gives us just a partial view, such as the front or back. However, animating a person in 3D requires a full texture map of their entire body. To address this, the 3DHM framework utilizes an in-filling diffusion model that hallucinates the unseen portions of the individual’s clothing and body by learning from available data. The strength of this approach lies in its invariance to human pose and viewpoint, which makes it efficient in learning from a limited set of samples.

Rendering Realism in Motion

The second phase is where 3DHM really showcases its unique abilities by transforming the intermediate texture-mapped figure into lifelike poses that include realistic fabric physics, such as the swirl of a skirt, the ripple of a shirt, or the sway of hair. This is accomplished through a diffusion-based rendering pipeline that takes into account 3D human poses. While many existing methods might struggle to capture such nuances or require large datasets, 3DHM achieves fidelity in the rendered image with 3D control, producing temporally coherent animations of complex human actions.

Experimental Validation and Results

The validation of 3DHM involved extensive testing and comparison with other state-of-the-art methods. The framework was trained and tested on a vast array of videos and evaluated using several different metrics. Based on these metrics, 3DHM demonstrated superiority in generating both individual frames and entire sequences of human motion while capturing the intricate details of appearance and pose with high accuracy. Notably, even when applied to previously unseen images and videos, including those from random YouTube content, it was able to animate the person with impressive realism.

Technological Implications and Future Directions

The advent of 3DHM doesn't just demonstrate a remarkable ability to synthesize human motion from still imagery; it represents a leap forward in our capability to animate photographs and generate realistic videos. There do, however, remain opportunities for improvement, such as ensuring consistency across a video sequence and the inability to perfectly capture unique attributes like logos on clothing due to the dataset limitations. Despite these challenges, 3DHM's ability to operate with limited data and its advancements in realistic rendering positions it as a potential game-changer in areas ranging from virtual reality to film production. Its methodological approach sidesteps the need for vast data troves and instead leverages the power of 3D rendering and machine learning in creative new ways, offering a glimpse into the future of human animation technology.

PDF Markdown

Related Papers

GitHub

Tweets

https://twitter.com/_akhaliq/status/1749277544975634614

https://twitter.com/fly51fly/status/1749558359404249273

https://twitter.com/AI_inAM/status/1749305160093122921

https://twitter.com/Liu3King/status/1752533365062869137

https://twitter.com/ai_bites/status/1749378823777398920