LLIA -- Enabling Low-Latency Interactive Avatars: Real-Time Audio-Driven Portrait Video Generation with Diffusion Models (2506.05806v1)

Published 6 Jun 2025 in cs.CV

Abstract: Diffusion-based models have gained wide adoption in the virtual human generation due to their outstanding expressiveness. However, their substantial computational requirements have constrained their deployment in real-time interactive avatar applications, where stringent speed, latency, and duration requirements are paramount. We present a novel audio-driven portrait video generation framework based on the diffusion model to address these challenges. Firstly, we propose robust variable-length video generation to reduce the minimum time required to generate the initial video clip or state transitions, which significantly enhances the user experience. Secondly, we propose a consistency model training strategy for Audio-Image-to-Video to ensure real-time performance, enabling a fast few-step generation. Model quantization and pipeline parallelism are further employed to accelerate the inference speed. To mitigate the stability loss incurred by the diffusion process and model quantization, we introduce a new inference strategy tailored for long-duration video generation. These methods ensure real-time performance and low latency while maintaining high-fidelity output. Thirdly, we incorporate class labels as a conditional input to seamlessly switch between speaking, listening, and idle states. Lastly, we design a novel mechanism for fine-grained facial expression control to exploit our model's inherent capacity. Extensive experiments demonstrate that our approach achieves low-latency, fluid, and authentic two-way communication. On an NVIDIA RTX 4090D, our model achieves a maximum of 78 FPS at a resolution of 384x384 and 45 FPS at a resolution of 512x512, with an initial video generation latency of 140 ms and 215 ms, respectively.

Collections

Sign up for free to add this paper to one or more collections.

Sign Up

Summary

The paper introduces a novel LLIA framework that enables low-latency, audio-driven portrait video generation using diffusion models.
It employs a dynamic variable-length video generation strategy and pipeline parallelism to meet real-time performance demands.
Experimental results on high-end GPUs demonstrate high FPS and effective facial expression control, validating the framework's practical potential.

LLIA: Low-Latency Interactive Avatar Generation via Diffusion Models

This paper introduces LLIA (low-latency interactive avatars), a novel framework for real-time audio-driven portrait video generation using diffusion models, addressing the challenges of speed, latency, and duration inherent in interactive avatar applications. The method focuses on minimizing response latency, achieving real-time performance, and enabling natural conversational states and fine-grained expression control.

Key Components and Innovations

The LLIA framework incorporates several key components to achieve its objectives. The model builds upon Stable Diffusion 1.5 (SD1.5) architecture and integrates Appearance ReferenceNet, Audio Layer, Motion Module, and Face Locator modules. LLIA pioneers a robust variable-length video generation training strategy to reduce response latency while preserving video quality. The system categorizes avatar states into listening, idle, and speaking, using class labels as conditional inputs for state control. LLIA also introduces a method for explicit facial expression control by manipulating the reference image fed into the ReferenceNet.

Figure 1: Overview of the proposed method, highlighting novel modules for portrait animation, state determination via class labels, and dynamic video length adaptation.

Variable-Length Video Generation

The core innovation in LLIA is the variable-length video generation training strategy. Response latency is a critical factor in interactive avatars, and reducing the length of generated video clips is essential. The paper introduces a dynamic training approach where the number of generated frames ( $f$ ) is initially fixed to train the motion module and audio layer effectively. Once the model converges, $f$ becomes dynamic during training, ensuring $f \geq n$ ( $n$ being the number of motion frames). This strategy enables the model to adopt different video generation lengths during inference without accumulating errors.

Real-Time Performance

To address the computational demands of diffusion models, the paper employs consistency models trained with a DDPM pre-trained model for few-step video generation. The model is further quantized to INT8, and the UNet and VAE modules are processed concurrently via pipeline parallelism to accelerate inference speed.

Figure 2: Illustration of the inference pipeline parallelism, showcasing concurrent processing of UNet and VAE modules for reduced latency.

Conversational States and Expression Control

LLIA incorporates class labels as conditional inputs to control avatar states (listening, idle, and speaking). The acoustic features of the audio input determine the specific state signal, allowing the model to interpret user intent and generate appropriate state transitions. Fine-grained facial expression control is achieved by applying a pre-trained portrait animation model to modify the expression in the reference image before feeding it into ReferenceNet.

Figure 3: The digital avatar performs a nodding response action in the listening state, controlled by the input class label prompt.

Figure 4: The model switches smoothly between various expressions according to different facial expressions in the input reference images.

Experimental Results and Analysis

The paper presents extensive experimental results demonstrating the effectiveness of LLIA. On an NVIDIA RTX 4090D, the model achieves a maximum of 78 FPS at a resolution of $384 \times 384$ and 45 FPS at $512 \times 512$ , with initial video generation latencies of 140 ms and 215 ms, respectively. Qualitative results showcase the model's ability to generate diverse avatar styles and maintain high generation quality with few-step sampling.

Figure 5: The model demonstrates qualitative results based on different avatar styles, showing the versatility of the approach.

Figure 6: Portraits generated with 2-step, 4-step, and 25-step sampling, demonstrating comparable quality with reduced sampling steps.

Dataset

A high-quality dataset of over 100 hours was curated, comprising open-source data, web-collected videos, and synthetic data designed to improve performance in listening and idle states. This dataset was filtered through a rigorous multistage pipeline involving automated processing and manual screening.

Implementation Details

The model was trained on 48 A100 GPUs, using Stable Diffusion v1.5 (SD1.5) weights for initialization. Various training configurations were employed, including random scaling of input images, AdamW optimizer, and MSE loss. Model quantization and pipeline parallelism were implemented for efficient deployment. INT8 quantization achieved significant speed enhancements while maintaining original model performance.

Conclusion

The LLIA framework presents a significant advancement in real-time audio-driven portrait video generation. By addressing the challenges of latency, computational cost, and interactivity, the method demonstrates the potential of diffusion models in time-sensitive, user-driven applications. The results indicate that diffusion-based digital avatars can achieve real-time interaction with low response latency, opening avenues for practical interactive applications.

PDF Markdown

LLIA -- Enabling Low-Latency Interactive Avatars: Real-Time Audio-Driven Portrait Video Generation with Diffusion Models (2506.05806v1)

Collections

Summary

LLIA: Low-Latency Interactive Avatar Generation via Diffusion Models

Key Components and Innovations

Variable-Length Video Generation

Real-Time Performance

Conversational States and Expression Control

Experimental Results and Analysis

Dataset

Implementation Details

Conclusion

Follow-up Questions

Authors (10)

Don't miss out on important new AI/ML research

LLIA -- Enabling Low-Latency Interactive Avatars: Real-Time Audio-Driven Portrait Video Generation with Diffusion Models (2506.05806v1)

Collections

Summary

LLIA: Low-Latency Interactive Avatar Generation via Diffusion Models

Key Components and Innovations

Variable-Length Video Generation

Real-Time Performance

Conversational States and Expression Control

Experimental Results and Analysis

Dataset

Implementation Details

Conclusion

Follow-up Questions

Related Papers

Authors (10)

Don't miss out on important new AI/ML research