VLOGGER: Multimodal Diffusion for Embodied Avatar Synthesis (2403.08764v1)

Published 13 Mar 2024 in cs.CV

Abstract: We propose VLOGGER, a method for audio-driven human video generation from a single input image of a person, which builds on the success of recent generative diffusion models. Our method consists of 1) a stochastic human-to-3d-motion diffusion model, and 2) a novel diffusion-based architecture that augments text-to-image models with both spatial and temporal controls. This supports the generation of high quality video of variable length, easily controllable through high-level representations of human faces and bodies. In contrast to previous work, our method does not require training for each person, does not rely on face detection and cropping, generates the complete image (not just the face or the lips), and considers a broad spectrum of scenarios (e.g. visible torso or diverse subject identities) that are critical to correctly synthesize humans who communicate. We also curate MENTOR, a new and diverse dataset with 3d pose and expression annotations, one order of magnitude larger than previous ones (800,000 identities) and with dynamic gestures, on which we train and ablate our main technical contributions. VLOGGER outperforms state-of-the-art methods in three public benchmarks, considering image quality, identity preservation and temporal consistency while also generating upper-body gestures. We analyze the performance of VLOGGER with respect to multiple diversity metrics, showing that our architectural choices and the use of MENTOR benefit training a fair and unbiased model at scale. Finally we show applications in video editing and personalization.

References (6)

Authors (6)

Enric Corona (14 papers)
Andrei Zanfir (13 papers)
Eduard Gabriel Bazavan (13 papers)
Nikos Kolotouros (13 papers)
Thiemo Alldieck (19 papers)
Cristian Sminchisescu (61 papers)

Citations (14)

View on Semantic Scholar

Summary

The paper presents a novel multimodal diffusion approach that generates dynamic 3D human motion and photorealistic talking avatars.
It employs a two-stage process with stochastic human-to-3D-motion and temporal image-to-image diffusion models to ensure spatiotemporal coherence.
Evaluation on large-scale benchmarks shows improved photorealism, lip-sync quality, and identity preservation over existing methods.

An Expert Overview of VLOGGER: Multimodal Diffusion for Embodied Avatar Synthesis

The paper "VLOGGER: Multimodal Diffusion for Embodied Avatar Synthesis" presents a novel framework designed to synthesize photorealistic and temporally coherent videos of humans talking and moving, based on a single input image and an audio sample. This approach, which builds on the generative capabilities of diffusion models, represents a significant step forward in the domain of human synthetic content generation by integrating not just facial expressions, but also upper-body motion and hand gestures.

Methodology

The VLOGGER methodology comprises two primary components:

A Stochastic Human-to-3D-Motion Diffusion Model: This component translates input audio into dynamic 3D human motion, capturing the nuanced mapping between speech and physical gestures, such as gaze, facial expressions, and body movements.
A Temporal Image-to-Image Diffusion Model: Extending modern image diffusion models to the temporal domain, this architecture utilizes spatial and temporal controls, including dense representations and warped images, to generate high-quality, coherent video sequences of variable lengths.

The system architecture leverages a 3D body model to predict motion parameters of expression and pose, which are then rasterized to create dense and semantic representations. These controls guide the video generation process, taking into account not only the head and face but also full-body dynamics.

Dataset and Training

The authors introduce MENTOR, an extensive dataset curated from a large repository of internal videos. It boasts over 800,000 unique identities and encompasses a broad spectrum of human diversity in terms of skin tone, age, viewpoint, and body visibility. Comprising over 2,200 hours of video content, MENTOR significantly surpasses the scale of previous datasets, providing a robust foundation for training and evaluating VLOGGER.

The training procedure involves a diffusion process where noise is progressively added to ground-truth samples, and the model learns to reconstruct the input data iteratively. The temporal diffusion model is trained in two stages, initially focusing on single images before incorporating video sequences to fine-tune spatiotemporal generation.

Evaluation and Results

The performance of VLOGGER is rigorously evaluated against several state-of-the-art benchmarks including the HDTF and TalkingHead-1KH datasets. Metrics for assessment cover a wide array of attributes:

Photorealism: Assessed via FID, CPBD, and NIQE scores.
Lip Sync and Temporal Consistency: Evaluated using LME, LSE-D, and jitter metrics.
Identity Preservation and Diversity of Expressions: Measured through head pose error, ArcFace distance, and temporal expression variance.

VLOGGER outperforms baseline methods across these benchmarks, especially in generating temporally coherent sequences and maintaining high identity fidelity. This is attributed to its effective use of 3D motion controls and the substantial diversity within the MENTOR dataset.

Applications and Future Developments

The practical applications of VLOGGER are manifold, spanning video editing and personalization to more advanced human-computer interaction systems such as virtual assistants and social presence agents. The system's ability to adapt and generate realistic, varied human movements has the potential to enhance user engagement in educational platforms, telemedicine, and customer service. Fine-tuning capabilities, as demonstrated through model personalization, indicate promising adaptations for individualized content generation.

Implications and Speculation

The implications of this research stretch into both theoretical and practical domains. The integration of multimodal inputs and 3D motion models represents a valuable advancement in generative AI models, fostering further exploration in this burgeoning field. The proposed architecture could pave the way for more sophisticated and contextually aware synthetic content generation systems in the future.

We anticipate that future research will build upon the methodologies introduced by VLOGGER, expanding the diversity metrics and improving control over the generated motion. Additionally, as the ethical considerations surrounding synthetic content become more pronounced, responsible use and thorough vetting of datasets and models will be pivotal.

In summary, VLOGGER sets a new benchmark in the synthesis of audio-driven human videos by leveraging advanced multimodal diffusion techniques and extensive, diverse datasets. It stands as a significant contribution to the field of AI-driven avatars and human-computer interaction, with widespread implications for future developments and practical applications in synthetic media generation.

PDF Markdown

Related Papers

Tweets

https://twitter.com/_akhaliq/status/1768113684771901848

https://twitter.com/Gradio/status/1768165811573731816

https://twitter.com/AISimpler/status/1769719477984251970

https://twitter.com/javaeeeee1/status/1768239869447242211

https://twitter.com/rezar/status/1769348279861485975

YouTube

Show All Videos