Speech Drives Templates: Co-Speech Gesture Synthesis with Learned Templates (2108.08020v2)

Published 18 Aug 2021 in cs.CV

Abstract: Co-speech gesture generation is to synthesize a gesture sequence that not only looks real but also matches with the input speech audio. Our method generates the movements of a complete upper body, including arms, hands, and the head. Although recent data-driven methods achieve great success, challenges still exist, such as limited variety, poor fidelity, and lack of objective metrics. Motivated by the fact that the speech cannot fully determine the gesture, we design a method that learns a set of gesture template vectors to model the latent conditions, which relieve the ambiguity. For our method, the template vector determines the general appearance of a generated gesture sequence, while the speech audio drives subtle movements of the body, both indispensable for synthesizing a realistic gesture sequence. Due to the intractability of an objective metric for gesture-speech synchronization, we adopt the lip-sync error as a proxy metric to tune and evaluate the synchronization ability of our model. Extensive experiments show the superiority of our method in both objective and subjective evaluations on fidelity and synchronization.

Citations (58)

View on Semantic Scholar

Summary

The paper introduces a dual-component architecture that decouples broad gesture configuration from subtle audio-driven adjustments for precise co-speech gesture synthesis.
It employs innovative metrics such as lip-sync error and the Fréchet Template Distance (FTD) to objectively assess gesture fidelity and synchronization.
Experimental results on the Speech2Gesture and Mandarin datasets demonstrate enhanced gesture realism and diversity compared to previous methods.

Speech Drives Templates: Co-Speech Gesture Synthesis with Learned Templates

The paper presents a novel approach to co-speech gesture synthesis by introducing learned gesture template vectors to address the longstanding issue of synthesizing gestures synchronized with speech audio. The fundamental challenge is the inherent ambiguity in the audio-to-gesture mapping, as multiple gestures could potentially correspond to the same input speech. While previous data-driven methods have attained some success, they often suffered from limited gesture variety, diminished fidelity, and the absence of reliable objective evaluation metrics.

The authors propose a dual-component model architecture comprising learned template vectors and audio-driven subtle movements. The template vectors determine the broad configuration of the gesture sequence, while the nuanced modifications driven by the speech ensure synchronization. This framework addresses gesture variety while maintaining synchronization fidelity by transforming the one-to-many mapping issue into a more tractable one-to-one problem using conditioned regression.

A noteworthy methodology employed in this paper is the use of lip-sync error as a substitute for objective metrics in gesture-speech synchronization. The hypothesis rests on the premise that effective lip-syncing necessitates precise gesture alignment as both are tightly coupled to the speech. Additionally, the paper introduces the Fréchet Template Distance (FTD), analogous to the Fréchet Inception Distance (FID), for evaluating the fidelity of gestures based on the statistical similarity of their distributions in a learned space.

Experimental evaluations on the Speech2Gesture dataset and additional Mandarin-speaking datasets demonstrate the model's superior performance in generating realistic, synchronized gestures compared to established methods such as Audio to Body Dynamics, Speech2Gesture, and MoGlow. The model shows improvement not just in standard metrics (although it generates high $L_2$ distances due to the non-deterministic nature of the generated sequences), but more significantly, in the newly proposed synchronization and fidelity metrics, $\mathcal{E}_{\text{lip}}$ and FTD, respectively.

The implications of this research extend beyond mere gesture generation to applications in digital human creation, human-computer interaction, and affective computing. The approach sets a precedent for leveraging template learning in multi-modal synthesis tasks, providing a framework that could be adapted to other domains where similarly disparate input-output mappings need to be reconciled.

Future research directions might explore refining the template learning process using more sophisticated models like VAEs, as the transition from individual sample templates to learned distributions indicates potential for deeper generative capabilities. Additionally, augmenting the model's expressiveness in terms of diverse environmental contexts and integrating emotion detection could further enhance the realism of synthesized gestures.

The paper contributes significantly to the field by innovatively synthesizing co-speech gestures with a method that effectively mitigates previous limitations of gesture diversity and synchronization fidelity, proposing novel evaluation metrics that could serve as standards for future work in this domain.

PDF Markdown

Related Papers

YouTube

Show All Videos