Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

Gemini 2.5 Flash 97 tok/s

Gemini 2.5 Pro 53 tok/s Pro

GPT-5 Medium 36 tok/s

GPT-5 High 34 tok/s Pro

GPT-4o 91 tok/s

GPT OSS 120B 462 tok/s Pro

Kimi K2 217 tok/s Pro

2000 character limit reached

TANGO: Co-Speech Gesture Video Reenactment with Hierarchical Audio Motion Embedding and Diffusion Interpolation (2410.04221v1)

Published 5 Oct 2024 in cs.CV

Abstract: We present TANGO, a framework for generating co-speech body-gesture videos. Given a few-minute, single-speaker reference video and target speech audio, TANGO produces high-fidelity videos with synchronized body gestures. TANGO builds on Gesture Video Reenactment (GVR), which splits and retrieves video clips using a directed graph structure - representing video frames as nodes and valid transitions as edges. We address two key limitations of GVR: audio-motion misalignment and visual artifacts in GAN-generated transition frames. In particular, (i) we propose retrieving gestures using latent feature distance to improve cross-modal alignment. To ensure the latent features could effectively model the relationship between speech audio and gesture motion, we implement a hierarchical joint embedding space (AuMoCLIP); (ii) we introduce the diffusion-based model to generate high-quality transition frames. Our diffusion model, Appearance Consistent Interpolation (ACInterp), is built upon AnimateAnyone and includes a reference motion module and homography background flow to preserve appearance consistency between generated and reference videos. By integrating these components into the graph-based retrieval framework, TANGO reliably produces realistic, audio-synchronized videos and outperforms all existing generative and retrieval methods. Our codes and pretrained models are available: \url{https://pantomatrix.github.io/TANGO/}

Citations (1)

View on Semantic Scholar

Collections

Summary

The paper introduces TANGO, achieving precise audio-gesture alignment with hierarchical embedding and diffusion-based interpolation.
It constructs a robust motion graph and employs AuMoCLIP to model temporal associations between audio and motion for smoother transitions.
Experimental results on datasets like YouTube Business show significant improvements in visual quality and beat consistency over prior methods.

Overview of TANGO: Co-Speech Gesture Video Reenactment

The paper presents TANGO, an advanced framework for generating high-fidelity co-speech gesture videos, precisely aligning body gestures with target speech audio. This innovation builds upon Gesture Video Reenactment (GVR) and addresses its limitations regarding audio-motion misalignment and visual artifacts. TANGO resolves these limitations by introducing an enhanced retrieval method through a hierarchical audio-motion embedding and a diffusion-based interpolation for smoother transitions.

Methodology

TANGO operates through a robust pipeline:

Graph Construction: TANGO utilizes a motion graph where video frames serve as nodes, and transitions between these frames constitute edges. This graph is refined through pruning to ensure connectivity, thereby enabling the sampling of extended video sequences without encountering dead endpoints.
Audio-Conditioned Gesture Retrieval: The core enhancement in TANGO over GVR is the development of AuMoCLIP, a hierarchical audio-motion joint embedding space. This novel feature facilitates implicit modeling of temporal associations between audio and motion. Using this dual-tower architecture, TANGO effectively retrieves gesture sequences that align closely with the audio input, ensuring smoother and more accurate co-speech coordination.
Diffusion-Based Interpolation: The integration of Appearance Consistent Interpolation (ACInterp) addresses the challenge of generating high-quality transition frames. ACInterp leverages diffusion models to eradicate visual artifacts and maintain consistency with reference video appearances. This method is a significant improvement over GAN-based interpolation, delivering sharper and more visually coherent outputs.

Results

TANGO's evaluation on datasets like Show-Oliver and the newly introduced YouTube Business dataset demonstrates its superiority over existing methods, both generative and retrieval-based. It surpasses its predecessors in metrics such as Feature Distance (FVD, FGD) and Beat Consistency, indicating a better alignment of video features and improved gesture diversity.

The paper's experimentations highlight TANGO's capability to produce visually realistic videos that are well-synchronized with audio inputs, a critical advancement for applications in news broadcasting and digital content creation.

Implications and Future Directions

TANGO facilitates more effective and efficient co-speech video generation, significantly reducing production costs. Its framework is not only pivotal for generating high-quality gesture-synchronized videos but also sets a precedent for future research in cross-modal video generation.

Future work could explore extending TANGO's methodologies to broader applications, such as general human motion reenactment in diverse settings. The integration of more detailed motion features and extended datasets could further enhance the robustness and applicability of TANGO in various fields of artificial intelligence and computer graphics.

The release of TANGO's codes and models also underscores a commitment to open-source research, encouraging further exploration and development by the research community.

PDF Markdown

Paper Prompts

Explore 10 Community Prompts

Follow-up Questions

Authors (7)

Tweets

https://twitter.com/taziku_co/status/1847420675197378809

https://twitter.com/gpbhupinder/status/1845683856050893197