Gesticulator: A framework for semantically-aware speech-driven gesture generation (2001.09326v5)

Published 25 Jan 2020 in cs.HC, cs.LG, and eess.AS

Abstract: During speech, people spontaneously gesticulate, which plays a key role in conveying information. Similarly, realistic co-speech gestures are crucial to enable natural and smooth interactions with social agents. Current end-to-end co-speech gesture generation systems use a single modality for representing speech: either audio or text. These systems are therefore confined to producing either acoustically-linked beat gestures or semantically-linked gesticulation (e.g., raising a hand when saying "high"): they cannot appropriately learn to generate both gesture types. We present a model designed to produce arbitrary beat and semantic gestures together. Our deep-learning based model takes both acoustic and semantic representations of speech as input, and generates gestures as a sequence of joint angle rotations as output. The resulting gestures can be applied to both virtual agents and humanoid robots. Subjective and objective evaluations confirm the success of our approach. The code and video are available at the project page https://svito-zar.github.io/gesticulator .

PDF Abstract

Semantically-Aware Speech-Driven Gesture Generation: Gesticulator Framework Overview

The paper introduces "Gesticulator," a novel model for generating semantically-aware co-speech gestures driven by speech input. This work addresses limitations in existing end-to-end gesture generation systems which typically employ a singular modality, using either the acoustic or textual component of speech to inform gesture synthesis. This unidirectional approach confines models to generating only one of two primary gesture types: beat gestures tied to speech acoustics, or semantic gestures associated with spoken content. The Gesticulator model innovatively utilizes both audio and text components of speech as dual input modalities, enabling the generation of both beat and semantic gestures. The proposed system is rigorously evaluated through subjective and objective metrics, underscoring its efficacy over previous methodologies.

Methodological Approach

Gesticulator is rooted in deep learning methodologies, applying both acoustic features extracted from speech signals and semantic features from speech text as inputs. These are transformed into gesture outputs via joint angle rotations in 3D space, which can be rendered on both virtual and robotic humanoid agents. The model training leverages the Trinity College Gesture Dataset, enriched with manual text annotations to capture semantic nuances.

A distinctive feature of the method is the combined use of historical and predictive speech segments to train the autoregressive model. The input features include speech spectrograms, encoded word vectors, and other supplementary features such as speaking rate and word durations. Subsequently, these feature vectors are processed through a multi-layer feedforward network, conditioned by the most recently generated poses to maintain continuity in motion output.

Results and Comparative Analysis

Objective evaluations quantify the output gestures' acceleration and jerk relative to the ground truth to measure motion fluidity and realism. A standout finding is reduced jerk due to autoregressive model components, maintaining motion continuity more effectively than non-autoregressive variants. The benchmark model, alongside ablated versions (e.g., removing text input, disabling FiLM conditioning, or excluding autoregressive inputs), consistently demonstrated enhancements over state-of-art unimodal approaches.

Subjective assessments involved human participants judging the quality of generated gestures in terms of human-likeness and contextual relevance to speech. Across multiple variations, Gesticulator, particularly the variant omitting PCA which offered higher gesture variability, was favored over the audio-driven baseline due to superior perceived alignment with speech content and natural articulation.

Implications and Future Directions

Gesticulator's integration of bi-modal speech attributes into gesture generation has meaningful implications for human-computer interaction, advancing the capabilities of social agents in providing coherent non-verbal communication alongside verbal cues. The cross-reality application potential spans virtual platforms and robotics, suggesting utility in domains like automated customer service, entertainment, or therapeutic virtual assistants where nuanced expressive interactions enhance user engagement.

Further progression in this research could encompass scaling the approach to larger diverse datasets, incorporating a stochastic element to the model output to embrace the inherent variability in human gesticulation, and explicitly distinguishing between different gesture categories during generation. Addressing limitations such as dataset annotation intensity and semantic granularity also presents pathways for enhancement, potentially leveraging speech recognition advancements for streamlined text alignment processes.

Gesticulator stands as a significant refinement in the evolution of gesture-generation systems, moving closer to the natural fluidity and communicative depth embodied by human non-verbal communication.

PDF Markdown Bookmark Chat (Pro)

Authors (7)

Taras Kucherenko (21 papers)
Patrik Jonell (8 papers)
Sanne van Waveren (4 papers)
Gustav Eje Henter (51 papers)
Simon Alexanderson (12 papers)
Iolanda Leite (29 papers)
Hedvig Kjellström (47 papers)

Citations (171)

View on Semantic Scholar

Related Papers

Find Related Papers

YouTube

Show All Videos