Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

TalkingMachines: Real-Time Audio-Driven FaceTime-Style Video via Autoregressive Diffusion Models (2506.03099v1)

Published 3 Jun 2025 in cs.SD, cs.AI, and cs.GR

Abstract: In this paper, we present TalkingMachines -- an efficient framework that transforms pretrained video generation models into real-time, audio-driven character animators. TalkingMachines enables natural conversational experiences by integrating an audio LLM with our video generation foundation model. Our primary contributions include: (1) We adapt a pretrained SOTA image-to-video DiT into an audio-driven avatar generation model of 18 billion parameters; (2) We enable infinite video streaming without error accumulation through asymmetric knowledge distillation from a bidirectional teacher model into a sparse causal, autoregressive student model; (3) We design a high-throughput, low-latency inference pipeline incorporating several key engineering optimizations such as: (a) disaggregation of the DiT and VAE decoder across separate devices, (b) efficient overlap of inter-device communication and computation using CUDA streams, (c) elimination of redundant recomputations to maximize frame-generation throughput. Please see demo videos here - https://aaxwaz.github.io/TalkingMachines/

Summary

  • The paper introduces a novel framework that adapts an 18B parameter I2V diffusion transformer for audio-driven avatar generation.
  • It leverages a modified distillation technique with sparse causal attention to generate continuous video streams in real time.
  • System optimizations, including GPU disaggregation and non-blocking CUDA streams, achieve low latency for interactive video calls.

This paper introduces "TalkingMachines," a framework designed to convert pretrained video generation models into real-time, audio-driven character animators, suitable for applications like FaceTime-style video calls. The system aims to create natural conversational experiences by integrating an audio LLM with a novel video generation foundation model.

The core contributions of TalkingMachines are:

  1. Adaptation of a Pretrained Model: A state-of-the-art 18 billion parameter image-to-video (I2V) Diffusion Transformer (DiT) (WAN 2.1) is adapted to become an audio-driven avatar generation model.
  2. Infinite Video Streaming: An asymmetric knowledge distillation technique, modified from Distribution Matching Distillation (DMD) and CausVid, enables a sparse causal, autoregressive student model to generate continuous video streams without error accumulation.
  3. Real-Time Performance: A high-throughput, low-latency inference pipeline is achieved through several engineering optimizations, crucial for interactive applications.

Model Architecture and Adaptation

The system builds upon the WAN2.1 I2V model, which generates video from an image and text prompt. Key modifications for audio-driven animation include:

  • Audio Cross Attention with Attention Mask: New layers are added to cross-attention mechanisms to process audio tokens. These tokens are cross-attended with latent frame tokens. Local attention masks focusing on facial regions are used to improve lip-sync accuracy, similar to MagicInfinite [yi2025magicinfinite].
  • Audio Projection Layer: A 1.2 billion parameter module processes raw audio token embeddings before they enter the cross-attention layers. Inspired by Hallo3 [cui2024hallo3], audio tokens are temporally aligned with video frames using a local window (five latent frames centered around the current frame).
  • Speaking/Silence Mode: The system detects if a face in a training clip is speaking or silent. For silent segments, a special embedding (e.g., zero embedding) replaces the audio embedding. This is determined using a lip-sync evaluation model like SyncNet [chung2017out]. During inference, this allows characters to appear to be listening when appropriate and supports multi-character interactions by switching modes based on speech turns.

Asymmetric Distillation for Autoregressive Generation

To enable real-time streaming, the bidirectional teacher model is distilled into an autoregressive student model using a modified CausVid [yin2024slow] approach. The video (21 latent frames from 81 video frames) is divided into 7 chunks of 3 latent frames each. The student model processes these chunks sequentially.

Sparse Causal Attention Pattern:

A token in chunk ctc_t attends to:

  • All tokens within its own chunk ctc_t (bidirectional attention).
  • All tokens in the preceding chunk ct1c_{t-1}.
  • All tokens in the initial chunk c0c_0 (which contains the ground truth reference image).

This pattern, K,V{c0,ct1,ct}K, V \in \{c_0, c_{t-1}, c_t\}, ensures temporal continuity and access to the clean reference, preventing error drift.

The model is distilled from 24 NFEs (Number of Function Evaluations, 12x2 with Classifier-Free Guidance) down to 2 NFEs using Distribution Matching Distillation (DMD). Key modifications to DMD include:

  1. Mixed Training Data: A combination of real video clips and synthetic samples (generated by the student model using an image-audio pair with sparse causal attention) is used for training. This is done progressively, introducing synthetic data after initial convergence on real data.
  2. Sparse Causal Attention: Applied during student model training as described above.
  3. Regression Loss: Added to the DMD loss to stabilize training, applied to the student model's predictions.

The overall DMD training workflow is visualized in Figure 2 of the paper, showing how the bidirectional teacher guides the autoregressive student, incorporating synthetic data generation and the combined loss functions.

System Optimizations for Real-Time Performance

Several engineering optimizations are crucial for achieving low latency:

  • Score-VAE Disaggregation: The diffusion model (score model) and the VAE decoder are run on separate devices (GPUs). Traditionally, VAE decoding was considered negligible, but with optimized Transformers, its latency becomes significant, especially in real-time contexts.
    • A single GPU performing both diffusion and VAE decoding struggles with real-time performance.
    • By dedicating one GPU to diffusion (worker) and another to VAE decoding (master), the system overlaps computation and improves throughput. Scaling to multiple diffusion workers with a single decoder further enhances performance.
    • The "Time Taken Between Chunks" (TTBC) metric is used to evaluate this. Disaggregated setups (Cases 3 and 4 in Figure 3) consistently meet real-time thresholds, unlike self-contained servers.
  • Efficient Computation-Communication Overlap with CUDA Streams: To manage data transfers between the master (decoder) and worker (diffusion) GPUs in the disaggregated setup, NCCL collectives are used. Additional CUDA streams are employed to make these transfers non-blocking, allowing VAE decoding to occur in parallel with communication, improving inference throughput.
  • KV Embedding Caching: Key-value (KV) pairs for chunks ct1c_{t-1} and c0c_0 are cached for each timestep across all transformer blocks. Other embeddings like timestep, guidance, and context embeddings are also cached to reduce redundant computations during inference.

Datasets and Training

  • Dataset: 1.5 million high-quality, human-centric video clips. Clips are filtered for aesthetics, motion, and excessive text (using OCR). Minimum duration is 4 seconds, featuring diverse speaker poses and distances. Most clips have a single speaker.
  • Target Resolution: 512x512, chosen for faster inference and training compared to the original model's 480x832, while being suitable for desktop and mobile.

Training Stages:

  1. Stage 1: WAN 2.1 Pretrained Model Warm-up (9k steps, 128 H100s for 1 day): Adapts the model to the new 512x512 resolution and human-centric data. All layers are trained.
  2. Stage 2: Audio Pretraining (30k steps, 384 H100s for 5 days): Trains the new audio layers for lip-sync. Non-audio parameters are frozen.
  3. Stage 3: Sparse Autoregressive Distillation (20k steps, 128 H100s for 10 days): Distills the model to 2 diffusion steps with sparse causal attention. All layers are trained.

Training Infrastructure and Strategies

  • At 512x512, the DiT model requires ~204GB GPU memory for parameters/gradients/optimizers (bfloat16) and terabytes for activations. Activation checkpointing and parameter sharding allow fitting within individual GPU memory, enabling data-parallel training.
  • Auxiliary models (VAE, encoders) also add memory overhead. DeepSpeed ZeRO Stage 3 is used to shard these encoders, freeing ~20GB GPU memory.
  • Stages 1 & 2: DeepSpeed ZeRO Stage 2 on H100 GPUs with GPUDirect-TCPX.
  • Stage 3 (DMD): Requires three full copies of the diffusion model (~450GB GPU memory).
    • ZeRO Stage 3 scaled poorly due to global parameter sharding communication.
    • Hybrid Sharded Data Parallel (HSDP) with PyTorch FSDP was implemented, sharding parameters/gradients/optimizers within localized GPU groups, reducing cross-node communication.
    • Training on H200 GPUs with InfiniBand showed a 2x speedup due to faster inter-node bandwidth and fitting all tensors within a single node.
  • Higher Resolution (720x720): Activation memory exceeds single GPU capacity. Sequence Parallelism (SP) is used as part of a 2D parallelism strategy (sharding model states via ZeRO/HSDP + splitting computation/activations via SP).

Distillation Ablations

An ablation paper (Table 1) compared chunk sizes (3 vs. 7 latent frames) and diffusion steps (2 vs. 4) using FVD, Sync-C, and Sync-D metrics.

  • Lip-sync quality (Sync-C, Sync-D) was relatively consistent.
  • FVD (perceptual quality) showed slight degradation with smaller chunk size (3) and fewer steps (2).
  • Computational Cost:
    • Chunk 7, 4 steps: 4 H100s (score) + 1 H100 (VAE) - Best FVD.
    • Chunk 7, 2 steps: 2 H100s (score) + 1 H100 (VAE).
    • Chunk 3, 4 steps: 2 H100s (score) + 1 H100 (VAE).
    • Chunk 3, 2 steps: 1 H100 (score) + 1 H100 (VAE) - Most efficient.

This allows choosing a configuration based on budget vs. quality. The 3x2 setup offers a good balance for resource-constrained scenarios.

Applications and System Architecture

A real-time FaceTime-style application demonstrates TalkingMachines:

  • Audio LLM Integration: Mainstream audio LLMs generate spoken responses.
  • Video Generation Server: TalkingMachines on H100 GPUs (disaggregated score model and VAE).
  • WebRTC Streaming: LiveKit handles real-time video streaming to web clients (desktop/mobile).

Workflow: User audio -> Audio LLM -> Conversational response (audio) -> TalkingMachines server -> Synchronized video frames -> WebRTC -> Client.

Limitations and Future Work

  • Late Audio Integration: Audio conditioning was introduced late in training, so the pretrained model didn't learn from large-scale audio-video data initially.
  • Limited Audio Data/Iterations: The large audio projection layers were trained on a relatively small dataset with limited iterations, potentially constraining scalability and expressiveness.
  • Future Work: Explore large-scale pretraining with audio conditioning from the start. Joint modeling of video and audio on massive paired datasets could improve lip-sync, multimodal representations, and performance across diverse scenarios.

In summary, TalkingMachines presents a practical framework for transforming large bidirectional video models into efficient, real-time, audio-driven animators. It combines model adaptation, novel distillation techniques, and crucial system-level optimizations to enable interactive applications like AI-powered video calls. The paper provides detailed insights into the architectural changes, training strategies, and engineering efforts required to achieve this.

Github Logo Streamline Icon: https://streamlinehq.com