From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations (2401.01885v1)

Published 3 Jan 2024 in cs.CV

Abstract: We present a framework for generating full-bodied photorealistic avatars that gesture according to the conversational dynamics of a dyadic interaction. Given speech audio, we output multiple possibilities of gestural motion for an individual, including face, body, and hands. The key behind our method is in combining the benefits of sample diversity from vector quantization with the high-frequency details obtained through diffusion to generate more dynamic, expressive motion. We visualize the generated motion using highly photorealistic avatars that can express crucial nuances in gestures (e.g. sneers and smirks). To facilitate this line of research, we introduce a first-of-its-kind multi-view conversational dataset that allows for photorealistic reconstruction. Experiments show our model generates appropriate and diverse gestures, outperforming both diffusion- and VQ-only methods. Furthermore, our perceptual evaluation highlights the importance of photorealism (vs. meshes) in accurately assessing subtle motion details in conversational gestures. Code and dataset available online.

References (45)

Authors (7)

Evonne Ng (8 papers)
Javier Romero (35 papers)
Timur Bagautdinov (22 papers)
Shaojie Bai (21 papers)
Trevor Darrell (324 papers)
Angjoo Kanazawa (84 papers)
Alexander Richard (33 papers)

Citations (20)

View on Semantic Scholar

Summary

The paper introduces a framework that uses audio-conditioned diffusion models and vector quantization to generate lifelike, gesture-rich avatars.
The methodology integrates multi-view conversational data to achieve high frame-rate rendering of full-body, facial, and hand expressions.
The research holds implications for improving virtual interactions in telepresence and online education while addressing privacy and long-range synthesis challenges.

Overview of Synthesizing Full-Bodied Photorealistic Avatars

In recent research, scientists have developed an innovative framework designed to create full-bodied, photorealistic avatars that gesture in response to the dynamics of a two-way conversation based solely on speech audio. This technology has the potential to improve the realism and expressiveness of digital human avatars, particularly in virtual communication scenarios.

The Science Behind Generating Dynamic Gestures

The methodology behind this breakthrough combines the diversity obtained from vector quantization with the detailed expressions afforded by diffusion models. This allows the avatars to exhibit a wide range of gestures and nuanced facial expressions (like subtle sneers or smirks) that are synchronized with spoken dialogue. The generated motion not only includes the entire body but also the face and hands, captured at a higher frame rate to convey more intricate movements.

To support this area of paper, the researchers have introduced a unique dataset, which is the first to offer multi-view conversational footage that enables photorealistic reconstruction. The experimental evaluations underscore the model's effectiveness in generating varied and fitting gestures, which surpass the performance of previous methods.

The Technology and Data

At the heart of this technology are two separate models: one for the face, leveraging an audio-conditioned diffusion model, and another for the body and hands, which uses an innovative combination of an autoregressive VQ-based method and a diffusion model. The personalized avatars are visualized through a neural renderer trained with multi-view capture data.

The researchers also compiled a new dataset to enable these advancements. The dataset consists of long-form dyadic interactions that cover a broad spectrum of emotions and conversational topics. Unlike previous datasets limited to skeletal or cartoon-like visualizations, this dataset can reconstruct individuals with photorealism, capturing the subtleties of real human interactions.

Implications and Applications

This technology has major implications for the future of virtual interaction systems. The ability to generate realistic avatars that respond naturally to audio cues can greatly enhance telepresence in technology such as virtual meetings, online education, and social VR. Additionally, the released dataset and code are set to further research into gesture generation with high-fidelity avatars, paving the way for more natural and immersive virtual experiences.

Reflecting on the Current Limitations

While this new method shows promising results in generating lifelike gestures for short audio segments, it is less adept at synthesizing movements that require a deep understanding of long-range conversational content. Additionally, the paper currently focuses on a small group of subjects for which consent has been granted, addressing privacy concerns while limiting the versatility of the avatars created. Despite these limitations, the project sets a new precedent in the development of photorealistic interactive avatars and poses essential questions about the future evaluation of such technology.

PDF Markdown

Related Papers

Tweets

https://twitter.com/AV_SP/status/1743027120983675209

https://twitter.com/fly51fly/status/1744108307881803857

https://twitter.com/AxSaucedo/status/1745347351114326139

https://twitter.com/123543935/status/1742766122359746668

https://twitter.com/arxivsanitybot/status/1743625159888490677

https://twitter.com/WilliamLamkin/status/1748128779715686676

HackerNews

From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations (1 point, 0 comments)