Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

GestureDiffuCLIP: Gesture Diffusion Model with CLIP Latents (2303.14613v4)

Published 26 Mar 2023 in cs.CV and cs.GR

Abstract: The automatic generation of stylized co-speech gestures has recently received increasing attention. Previous systems typically allow style control via predefined text labels or example motion clips, which are often not flexible enough to convey user intent accurately. In this work, we present GestureDiffuCLIP, a neural network framework for synthesizing realistic, stylized co-speech gestures with flexible style control. We leverage the power of the large-scale Contrastive-Language-Image-Pre-training (CLIP) model and present a novel CLIP-guided mechanism that extracts efficient style representations from multiple input modalities, such as a piece of text, an example motion clip, or a video. Our system learns a latent diffusion model to generate high-quality gestures and infuses the CLIP representations of style into the generator via an adaptive instance normalization (AdaIN) layer. We further devise a gesture-transcript alignment mechanism that ensures a semantically correct gesture generation based on contrastive learning. Our system can also be extended to allow fine-grained style control of individual body parts. We demonstrate an extensive set of examples showing the flexibility and generalizability of our model to a variety of style descriptions. In a user study, we show that our system outperforms the state-of-the-art approaches regarding human likeness, appropriateness, and style correctness.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Tenglong Ao (9 papers)
  2. Zeyi Zhang (4 papers)
  3. Libin Liu (20 papers)
Citations (114)

Summary

  • The paper introduces a conditional latent diffusion model that integrates CLIP embeddings for precise and controllable co-speech gesture generation.
  • It employs a VQ-VAE for efficient latent motion representation and contrastive learning to ensure semantic alignment between gestures and speech.
  • Evaluations on BEAT and ZeroEGGS datasets reveal improved gesture realism and style accuracy, underscoring its potential for real-time applications.

GestureDiffuCLIP: Gesture Diffusion Model with CLIP Latents

The paper "GestureDiffuCLIP: Gesture Diffusion Model with CLIP Latents" introduces a novel framework for co-speech gesture generation that leverages the capabilities of latent diffusion models and the CLIP architecture. Researchers have long been interested in generating stylized gestures that align semantically and rhythmically with speech. This framework attempts to broaden the flexibility and control over gesture styles by integrating multi-modal prompts, including text, motion, and video.

The proposed system, GestureDiffuCLIP, enhances the generation of gestures by using a Conditional Latent Diffusion Model (LDM) that incorporates style guidance through Contrastive-Language-Image-Pre-training (CLIP) embeddings. The method introduces an adaptable framework capable of producing realistic gestures aligned semantically with input speech. It also allows for fine-grained control over gestures by enabling style conditioning from multiple modalities.

Approach and Architecture

Latent Motion Representation: The model utilizes a VQ-VAE which compresses motion data into a lower-dimensional latent representation. This approach ensures both the diversity and quality of the generated gestures.

Gesture-Transcript Joint Embeddings: An essential facet of the system is the embedding space shared between gesture features and speech transcripts, achieved via a contrastive learning approach. This joint embedding facilitates the extraction of semantic consistency between gestures and language.

Conditional Diffusion Model: The core model, a diffusion-based generator, relies on latent diffusion conditioning on given speech and style prompts. The denoising network within uses self-attention and adaptive instance normalization (AdaIN) to gracefully integrate audio, textual information, and style cues, thus balancing semantic coherence with stylization.

Style Control with CLIP: By utilizing pre-trained CLIP encoders, the system can map various styles from text, videos, or motion demonstrations into a coherent space. These are infused into the generator through AdaIN, enabling the model to adhere to desired stylistic elements during generation.

Contrastive Learning: The authors utilize a CLIP-style contrastive loss to align speech transcripts and gesture embeddings, ensuring the gestures produced are semantically relevant to the associated transcript.

Evaluation and Results

GestureDiffuCLIP is evaluated using BEAT and ZeroEGGS datasets for motion quality and style accuracy. Qualitatively, it shows improvement over several state-of-the-art methods, especially in generating contextually appropriate gestures. It effectively generalizes across unseen styles and maintains high semantic fidelity, which is verified via user studies involving metrics such as human likeness, appropriateness, and style correctness.

Quantitative metrics like Fréchet Gesture Distance (FGD), Style Recognition Accuracy (SRA), and Semantics-Relevant Gesture Recall (SRGR) further reinforce its performance robustness. The introduction of Semantic Score (SC) provides new insight into semantic coherence evaluation by quantifying the similarity between generated gestures and transcript semantics, underscoring the model’s proficiency in maintaining content congruity.

Implications and Future Directions

GestureDiffuCLIP marks an advancement in co-speech gesture generation, providing a flexible framework that bridges multiple modalities within a generative context. The architecture's ability to incorporate arbitrary style control while retaining semantic alignment signals a significant improvement over previous systems dependent on rigid label sets or example-based methods lacking flexibility.

Future avenues can further explore the scalability of this framework to allow real-time gesture synthesis, enhanced by more efficient sampling techniques. Further, the integration with large-scale multimodal datasets could lead to discovering even more robust gesture styles and smoother transitions driven by diverse prompt combinations. Addressing aspects like latency in the diffusion process and expanding the interpretive capacity of CLIP-style semantic space learning could substantially boost system versatility and application in interactive digital environments.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

Youtube Logo Streamline Icon: https://streamlinehq.com