Emotional Speech-driven 3D Body Animation via Disentangled Latent Diffusion
Abstract: Existing methods for synthesizing 3D human gestures from speech have shown promising results, but they do not explicitly model the impact of emotions on the generated gestures. Instead, these methods directly output animations from speech without control over the expressed emotion. To address this limitation, we present AMUSE, an emotional speech-driven body animation model based on latent diffusion. Our observation is that content (i.e., gestures related to speech rhythm and word utterances), emotion, and personal style are separable. To account for this, AMUSE maps the driving audio to three disentangled latent vectors: one for content, one for emotion, and one for personal style. A latent diffusion model, trained to generate gesture motion sequences, is then conditioned on these latent vectors. Once trained, AMUSE synthesizes 3D human gestures directly from speech with control over the expressed emotions and style by combining the content from the driving speech with the emotion and style of another speech sequence. Randomly sampling the noise of the diffusion model further generates variations of the gesture with the same emotional expressivity. Qualitative, quantitative, and perceptual evaluations demonstrate that AMUSE outputs realistic gesture sequences. Compared to the state of the art, the generated gestures are better synchronized with the speech content, and better represent the emotion expressed by the input speech. Our code is available at amuse.is.tue.mpg.de.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
Overview
This paper is about teaching a computer to animate a 3D human body that gestures while someone is speaking—and to make those gestures clearly show the speaker’s emotion (like happy, angry, or sad). The system, called AMUSE, listens to speech and creates realistic whole-body movements that match both the rhythm of the words and the feeling behind them. It also lets you control the emotion and personal style of the gestures.
Key Objectives
Here are the main questions the researchers wanted to answer:
- Can we generate 3D body gestures from speech that look natural and match the timing of the words?
- Can we control the emotion shown by the gestures (for example, make them “angry” or “happy”) without changing the spoken content?
- Can we separate three things in speech—what is being said (content), the emotion, and the speaker’s personal style—so we can mix and match them?
Methods and Approach
To explain the technical parts, think of it like mixing music tracks:
- One track carries the lyrics (content).
- Another track carries the mood (emotion).
- A third track carries the singer’s unique style (style).
AMUSE tries to separate speech into three “secret codes” (called latent vectors): one for content, one for emotion, and one for style. Then it uses these codes to guide a motion generator that creates 3D body gestures.
Here’s how it works:
- Audio disentanglement: The system listens to the audio and produces three separate codes:
- Content code: represents the words and rhythm.
- Emotion code: represents the feeling (e.g., happy, angry).
- Style code: represents the speaker’s personal way of moving.
- This separation is called “disentanglement”: pulling apart mixed information into clean pieces.
- Motion prior (learning how bodies move smoothly): The model uses a tool called a VAE (Variational Autoencoder). You can think of a VAE as a compressor-decompressor that learns to represent smooth, realistic motion in a compact way. It helps the gestures avoid jerky or unnatural movement.
- Latent diffusion (turning noise into motion): Diffusion starts with random noise and slowly “denoises” it into a meaningful output. Imagine starting with a messy static image and gradually refining it into a clean sketch. Here, the system starts with random motion in “latent space” (the compressed motion world) and step-by-step shapes it into a realistic motion sequence, guided by the content, emotion, and style codes from the audio.
- 3D Body Model (SMPL-X): This is a standard, detailed 3D body model that includes hands and upper body joints. It lets the system produce natural-looking 3D meshes (surfaces) of people rather than simple stick figures.
- Training data: They used a large dataset of people speaking and gesturing (called BEAT). This dataset includes different emotions and speakers. The authors converted the original skeleton motion to SMPL-X surfaces to get more realistic, full-body animations.
- Editing gestures by mixing codes: Because content, emotion, and style are separated, you can:
- Keep the content from one speech but swap in the emotion from another (e.g., make the same words sound and look “angry”).
- Change the style to imitate another person’s way of moving.
Main Findings and Why They Matter
The researchers tested their system in several ways—by measuring timing alignment, variety of motions, and whether humans felt the motions matched the emotions. Here’s what they found:
- Better timing with speech: Gestures were well-synchronized with the speech rhythm (the movements matched the beats in the audio).
- Clearer emotional expression: The system’s gestures matched the intended emotion better than previous methods.
- Realistic and diverse motions: Even when the words stayed the same, the system could produce different, natural variations of gestures, like a real person would.
- Human preference: In a user study, people preferred AMUSE’s gestures for both “sync with speech” and “emotion appropriateness” compared to other state-of-the-art methods.
- Emotion and style control: You can take the content from one speech and fuse it with the emotion or style from another. For example, keep the words the same but make the gestures look “sad” or “angry” depending on your choice.
These results are important because they show that you can control how a virtual character feels and moves just by using speech, without needing extra instructions or manual animation.
Implications and Impact
This research can improve many areas:
- Virtual reality and video games: Characters and avatars can gesture naturally and show appropriate emotions, making interactions feel more human.
- Online communication and digital assistants: Virtual presenters, tutors, or assistants can look more engaging and empathetic by matching their gestures to the emotion in their voice.
- Movies and animation: Faster, more flexible character animation from voice recordings, with control over emotional tone and style.
The authors note future directions, like adding lower-body motion and facial expressions, and integrating text (to better capture meaning, not just rhythm). But even as it is, AMUSE shows a big step forward: it makes animated bodies that move with the voice and express emotion in a controlled, realistic way.
Glossary
- AdaIn: Adaptive Instance Normalization; a normalization layer that aligns feature statistics to enable style control or transfer. "via AdaIn~\cite{huang2017arbitrary} layers"
- AdamW: An optimization algorithm that decouples weight decay from the gradient update in Adam. "the AdamW optimizer \cite{loshchilovH19adamw}"
- axis angle representation: A rotation parameterization using an axis and an angle, commonly used for skeletal motion. "convert the skeleton motion data into the SMPL-X axis angle representation."
- BEAT: A large-scale speech-to-3D-gesture motion capture dataset with emotion annotations. "BEAT \cite{liu2022beat} is a good candidate"
- Beat align (BA): A metric measuring how well motion beat events temporally align with audio beats. "Beat align (BA):"
- CLIP: A contrastive multimodal model that learns joint text–image representations. "CLIP~\cite{radford2021learning} text features"
- conformers: Convolution-augmented Transformer architectures designed for speech sequence modeling. "combines conformers and the DiffWave~\cite{kong2021diffwave} architecture"
- cross-attention: An attention mechanism where one sequence attends to another (e.g., queries attend to memory). "every cross-attention transformer layer"
- DDIM: Denoising Diffusion Implicit Models; a deterministic sampler for diffusion models enabling fast inference. "We employ DDIM~\cite{song2020denoising}"
- DeiT: Data-efficient Image Transformers; a Vision Transformer variant adapted here to audio filterbanks. "the DeiT visual transformer \cite{TouvronCDMSJ21deit}"
- deictic: A class of gestures that point to or indicate objects, locations, or directions. "beat, deictic, iconic, and metaphoric."
- DiffWave: A diffusion-based neural vocoder architecture for speech synthesis. "the DiffWave~\cite{kong2021diffwave} architecture"
- FLAME: A parametric 3D face model for facial shape and expressions. "it lacks face mocap markers and FLAME expressions."
- Fréchet gesture distance (FGD): A distributional distance (analogous to FID) computed between generated and real gesture features. "Fréchet gesture distance (FGD):"
- Hamming window: A signal processing windowing function used when computing short-time spectra. "a 25ms Hamming frame window"
- iconic: Gestures that depict the shape, size, or movement of objects or actions. "beat, deictic, iconic, and metaphoric."
- latent diffusion model: A diffusion model that operates in a learned latent space instead of pixel/pose space. "a latent diffusion model"
- latent variable model: A probabilistic model that includes unobserved (latent) variables, enabling structured generation. "is a latent variable model \cite{rombach2021highresolution}"
- mel-frequency bins: Spectral bands on the mel scale used in audio filterbanks to approximate human pitch perception. "with 128 mel-frequency bins"
- metaphoric: Gestures that represent abstract concepts metaphorically rather than concrete actions or objects. "beat, deictic, iconic, and metaphoric."
- MoGlow: A normalizing-flow-based model for motion synthesis with controllable style. "MoGlow~\cite{henter2020moglow}"
- MoSh++: A method that fits parametric body models to sparse motion-capture markers to recover body parameters. "processed using MoSh++~\cite{Loper:SIGASIA:2014, AMASS:2019}"
- motion prior: A learned generative prior over motion sequences that regularizes and structures motion generation. "our motion prior network is a VAE transformer architecture"
- reparametrization trick: A technique in VAEs to enable backpropagation through stochastic sampling. "via the reparametrization trick"
- RQ-VAE: Residual Quantized VAE; a hierarchical vector quantization autoencoder enabling discrete latent representations. "uses an RQ-VAE to generate different gestures from speech."
- Semantic-Relevant Gesture Recall (SRGR): A metric assessing how well generated gestures match semantic categories of ground-truth gestures. "Semantic-Relevant Gesture Recall (SRGR):"
- sinusoidal positional encoding: A deterministic encoding of positions/timesteps using sinusoids, used to inject order into Transformers. " is a sinusoidal positional encoding of diffusion timestep "
- SMPL-X: A parametric 3D human body model including body, hands, and face components. "SMPL-X \cite{SMPL-X:2019} is a 3D model of the body surface."
- stop-gradient operation: An operation that blocks gradient flow through certain tensors during training. "a stop-gradient operation "
- U-Net: An encoder–decoder architecture with skip connections widely used in generative denoising networks. "U-Net-like \cite{ronneberger2015unet} structure"
- variational autoencoder (VAE): A probabilistic autoencoder that learns a latent distribution for generative modeling. "a temporal variational autoencoder (VAE)"
- VQ-VAE: Vector Quantized VAE; an autoencoder with discrete codebooks for latent representations. "uses a VQ-VAE to generate 3D human bodies"
Collections
Sign up for free to add this paper to one or more collections.