One Shot, One Talk: Whole-body Talking Avatar from a Single Image (2412.01106v1)

Published 2 Dec 2024 in cs.CV and cs.GR

Abstract: Building realistic and animatable avatars still requires minutes of multi-view or monocular self-rotating videos, and most methods lack precise control over gestures and expressions. To push this boundary, we address the challenge of constructing a whole-body talking avatar from a single image. We propose a novel pipeline that tackles two critical issues: 1) complex dynamic modeling and 2) generalization to novel gestures and expressions. To achieve seamless generalization, we leverage recent pose-guided image-to-video diffusion models to generate imperfect video frames as pseudo-labels. To overcome the dynamic modeling challenge posed by inconsistent and noisy pseudo-videos, we introduce a tightly coupled 3DGS-mesh hybrid avatar representation and apply several key regularizations to mitigate inconsistencies caused by imperfect labels. Extensive experiments on diverse subjects demonstrate that our method enables the creation of a photorealistic, precisely animatable, and expressive whole-body talking avatar from just a single image.

Summary

The paper introduces a novel method using a coupled 3D Gaussian Splatting-mesh model to generate expressive, whole-body talking avatars from a single image.
It leverages pose-guided motion diffusion models to produce synthetic video frames, enabling realistic animation across diverse gestures and expressions.
Empirical results demonstrate superior realism and photorealism, outperforming methods reliant on large-scale datasets in key benchmark metrics.

Overview of "One Shot, One Talk: Whole-body Talking Avatar from a Single Image"

The paper presents a novel approach to constructing whole-body talking avatars from a single image, which opens up new possibilities for augmented and virtual reality applications. The method tackles two significant challenges: modeling complex dynamics and generalizing to new gestures and expressions. By leveraging recent developments in pose-guided image-to-video diffusion models, the authors generate pseudo-labels from video frames to guide the avatar creation process. The use of a coupled 3DGS-mesh avatar representation, along with precise regularizations, proves effective in mitigating inconsistencies from noisy pseudo videos.

Methodology

The proposed approach introduces a tightly coupled 3D Gaussian Splatting (3DGS) and mesh model, which enables capturing both geometry and texture details more effectively than traditional methods. This hybrid representation leverages parametric models, such as SMPL-X, to integrate body and facial movements, allowing for more realistic and personalized avatar synthesis.

In contrast to existing methods that require dense datasets or prolonged video captures, this approach relies on a single image, making it accessible for typical consumer applications. To compensate for the lack of complete data from a single image, the method utilizes motion diffusion models to generate synthetic video frames that expand the observable pose and expression space. This generates a pseudo dataset that aids in optimizing the avatar such that it can realistically animate across a wide range of gestures and expressions.

Numerical Results and Claims

The paper reports extensive experimental results, demonstrating that the proposed method achieves superior realism and animation quality compared to existing techniques, even those relying on much larger datasets. The authors provide both qualitative and quantitative comparisons, showing that their method produces avatars with consistent identity features and dynamic expressiveness proximal to real-world likeness.

The paper's claim that its method generates expressive and photorealistic avatars stems from evaluations against key benchmarks in avatar rendering fidelity: MSE, L1 distance, PSNR, SSIM, and LPIPS metrics. In each case, the method outperformed representative techniques like ExAvatar and MimicMotion, confirming the efficacy of their approach in synthesizing realistic dynamic avatars from minimal input.

Implications and Future Prospects

This research has significant implications for interactive platforms and digital communications, where personalized avatars can enhance the user experience through lifelike animations. The capability to generate avatars from a single image significantly reduces entry barriers for avatar creation in virtual environments, potentially democratizing access to high-fidelity avatar technology.

However, the approach relies heavily on initial tracking accuracy and is susceptible to errors in scenarios where tracking data is noisy or incomplete. Moreover, the technology can be misused, prompting an ethical discussion on how such technologies should be regulated to prevent the spread of disinformation or identity misuse.

Looking toward the future, integrating LLMs with this framework could further improve avatar expressiveness by combining verbal cues with physical gestures, leading to more engaging digital interactions. Exploring ways to enhance viewpoint generalization, especially in virtual meetings or gaming applications, remains an important avenue for research advancement.

In conclusion, the paper presents a substantial step toward realistic avatar synthesis and its implications resonate across various domains from entertainment to remote communication, promising to reshape how users interact in digital landscapes.

PDF Markdown

Related Papers

Tweets

https://twitter.com/_akhaliq/status/1864512090586701875

https://twitter.com/IAMJBDEL/status/1864746642106388805

https://twitter.com/arXivGPT/status/1865098381313409157

YouTube

Show All Videos