Learning Individual Styles of Conversational Gesture (1906.04160v1)

Published 10 Jun 2019 in cs.CV, cs.LG, and eess.AS

Abstract: Human speech is often accompanied by hand and arm gestures. Given audio speech input, we generate plausible gestures to go along with the sound. Specifically, we perform cross-modal translation from "in-the-wild'' monologue speech of a single speaker to their hand and arm motion. We train on unlabeled videos for which we only have noisy pseudo ground truth from an automatic pose detection system. Our proposed model significantly outperforms baseline methods in a quantitative comparison. To support research toward obtaining a computational understanding of the relationship between gesture and speech, we release a large video dataset of person-specific gestures. The project website with video, code and data can be found at http://people.eecs.berkeley.edu/~shiry/speech2gesture .

PDF Abstract

Learning Individual Styles of Conversational Gesture

The paper of conversational gestures, which intersect the fields of computer vision, linguistics, and cognitive science, involves understanding the non-verbal communication channels employed during speech. The paper "Learning Individual Styles of Conversational Gesture" explores the prediction of hand and arm gestures accompanying speech using a cross-modal translation model. This paper presents a significant advance in understanding the correlation between speech and corresponding gestures by providing a computational approach to predict gestures from raw audio signals.

Methodological Approach

The core methodology revolves around translating audio inputs to gesture motions, tailored for each speaker's unique style. The researchers utilize "in-the-wild" monologue speech videos, conducting their experiments on a dataset comprising diverse backgrounds such as lecturers and television hosts. They devise a model trained on unlabeled video data, leveraging 2D skeletal keypoints extracted by pose detection systems to serve as pseudo ground truth.

Their system includes a two-stage process that applies temporal convolutional architectures for efficient translation and incorporates an adversarial discriminator. The use of a UNet-like architecture allows for flexibility and temporal context processing, minimizing asynchronicity and multimodal prediction challenges. The adversarial component enhances the realism of the produced gestures, aligning them closely with the naturally occurring ones in terms of temporal coherence and speaker style adherence.

Dataset and Experiments

The paper introduces a comprehensive 144-hour video dataset, capturing ten distinct speakers crafted to support person-specific modeling of gestures. Their experimental setup examines various baselines, including models trained on median poses and nearest-neighbor selections, and further evaluates against previously established RNN-based architectures. The researchers proceed with quantitative comparisons through $L_1$ and PCK metrics and a user paper on perceptual realism to assess the system's efficacy.

Their results showcase superior performance of their methodology over existing models, especially in predicting gestures that align with each speaker's unique style. The quantitative scores underline the model's aptitude for generating plausible gestures, complemented by perceptual evaluations affirming the qualitative coherence of the generated gestures.

Implications and Future Developments

This research holds vital implications for advancing the development of virtual conversational agents and enhancing human-computer interaction systems, particularly in generating natural and personalized non-verbal communication. It also opens possibilities for employing gestures as a feature in video content analysis, leveraging the natural alignment of gesture and speech as a synchronized communicative cue.

Future work may expand upon the challenge of asynchronicity and multimodality in gesture prediction. Integrating high-level linguistic features with the proposed audio-to-gesture translation might further align predicted gestures with intended speech semantics. Additionally, refining pose detection accuracy will likely reduce pseudo ground truth noise, thereby enhancing model precision.

In conclusion, the paper makes a noteworthy contribution to the computational understanding of speech-gesture dynamics, offering a robust framework and dataset for future exploration in machine perception of human interactions. The methodologies proposed set a foundation for more sophisticated systems capable of deciphering and reproducing human-like gestural intricacies in real-world applications.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Shiry Ginosar (16 papers)
Amir Bar (31 papers)
Gefen Kohavi (4 papers)
Caroline Chan (5 papers)
Andrew Owens (52 papers)
Jitendra Malik (211 papers)

Citations (299)

View on Semantic Scholar

Learning Individual Styles of Conversational Gesture (1906.04160v1)