OmniHuman-1: Rethinking the Scaling-Up of One-Stage Conditioned Human Animation Models (2502.01061v3)

Published 3 Feb 2025 in cs.CV

Abstract: End-to-end human animation, such as audio-driven talking human generation, has undergone notable advancements in the recent few years. However, existing methods still struggle to scale up as large general video generation models, limiting their potential in real applications. In this paper, we propose OmniHuman, a Diffusion Transformer-based framework that scales up data by mixing motion-related conditions into the training phase. To this end, we introduce two training principles for these mixed conditions, along with the corresponding model architecture and inference strategy. These designs enable OmniHuman to fully leverage data-driven motion generation, ultimately achieving highly realistic human video generation. More importantly, OmniHuman supports various portrait contents (face close-up, portrait, half-body, full-body), supports both talking and singing, handles human-object interactions and challenging body poses, and accommodates different image styles. Compared to existing end-to-end audio-driven methods, OmniHuman not only produces more realistic videos, but also offers greater flexibility in inputs. It also supports multiple driving modalities (audio-driven, video-driven and combined driving signals). Video samples are provided on the ttfamily project page (https://omnihuman-lab.github.io)

Summary

The paper presents an omni-conditions training strategy that integrates text, audio, and pose to scale human animation models.
It employs the Diffusion Transformer architecture to generate high-quality videos with improved gesture generation and object interaction.
Experimental evaluations on 18.7K hours of data demonstrate enhanced natural motion, accurate lip-syncing, and robust performance across varied inputs.

The paper introduces OmniHuman, a novel end-to-end framework for generating realistic human animation videos conditioned on multiple modalities. Current end-to-end human animation models are limited by their inability to scale up with large and diverse datasets, which restricts their applicability. The authors address this limitation by introducing an omni-conditions training strategy that leverages mixed data with varying degrees of conditioning signals, such as text, audio, and pose. OmniHuman is based on the Diffusion Transformer (DiT) architecture and can generate high-quality human videos with improved gesture generation and object interaction.

The key contributions of this work are:

The OmniHuman model, a mixed-conditioned human video generation model, trained using an omni-conditions training strategy that integrates various motion-related conditions and their corresponding data.
A demonstration of highly realistic and vivid human motion video generation, supporting multiple modalities simultaneously, handling different portrait and input aspect ratios, and improving gesture generation.

The paper addresses the limitations of existing end-to-end human animation models, which struggle to scale up due to the need for highly filtered datasets. Raw training data often contains unrelated factors such as body poses, background motion, camera movement, and lighting changes that can negatively impact training. To overcome these challenges, the authors propose an omni-conditions training strategy that incorporates multiple conditioning signals during training, thus reducing data wastage. This approach offers two main advantages:

Data that would otherwise be discarded for single-condition models can be leveraged in tasks with weaker or more general conditions, such as text conditioning.
Different conditioning signals can complement each other. For example, stronger conditions such as pose inputs can provide additional guidance to audio data.

The omni-conditions training strategy follows two key principles:

Stronger conditioned tasks can leverage weaker conditioned tasks and their corresponding data to achieve data scaling up during the model training process.
The stronger the condition, the lower the training ratio that should be used.

The OmniHuman model is based on the DiT architecture and can train with three motion-related conditions (text, audio, and pose) from weak to strong. This approach addresses the data scaling up challenge in end-to-end frameworks, allowing the model to benefit from large-scale data training, learn natural motion patterns, and support various input forms.

The framework consists of two primary parts: the OmniHuman model, a multi-condition diffusion model, and the Omni-Conditions Training Strategy. The OmniHuman model begins with a pretrained Seaweed model, which uses MMDiT and is initially trained on general text-video pairs for text-to-video and text-to-image tasks. Given a reference image, the OmniHuman model aims to generate human videos using one or more driving signals including text, audio and pose. The model utilizes a causal 3DVAE to project videos at their native size into a latent space and employs flow matching as the training objective to learn the video denoising process. The method employs a three-stage mixed condition post-training approach to progressively transform the diffusion model from a general text-to-video model to a multi-condition human video generation model.

For injecting audio and pose conditions, the wav2vec model is employed to extract acoustic features, which are subsequently compressed using a multilayer perceptron (MLP) to align with the hidden size of MMDiT. The features of each frame are concatenated with the audio features from adjacent timestamps to generate audio tokens for the current frame. These audio tokens are injected into each block of MMDiT through cross-attention, enabling interaction between the audio tokens and the noisy latent representations. To incorporate pose condition, a pose guider is used to encode the driving pose heatmap sequence. The resulting pose features are concatenated with those of adjacent frames to acquire pose tokens. These pose tokens are then stacked with the noise latent along the channel dimension and fed into the unified multi-condition diffusion model for visual alignment and dynamic modeling. The text condition is retained as in the MMDiT text branch.

To preserve both the subject’s identity and the background details from a reference image, OmniHuman introduces a simple yet effective strategy for reference conditioning. Instead of constructing additional network modules, the method reuses the original denoising DiT backbone to encode the reference image. Specifically, the reference image is first encoded into a latent representation using a VAE, and both the reference and noisy video latents are flattened into token sequences. These sequences are then packed together and simultaneously fed into the DiT, enabling the reference and video tokens to interact via self-attention across the entire network. To help the network distinguish between reference and video tokens, the 3D Rotational Position Embeddings (RoPE) in the DiT are modified by zeroing the temporal component for reference tokens, while leaving the RoPE for video tokens unchanged. This approach effectively incorporates appearance conditioning without adding extra parameters.

The model training is divided into multiple tasks, including image and text to video, image and text, audio to video, and image and text, audio, pose to video. After the conventional text-to-video pretraining phase, the method follows two training principles for scaling up the conditioned human video generation task. Stronger conditioned tasks can leverage weaker conditioned tasks and their corresponding data to achieve data scaling up during the model training process. Data excluded from audio and pose conditioned tasks due to filtering criteria like lip-sync accuracy, pose visibility, and stability can be used in text and image conditioned tasks, as they meet the standards for weaker conditions. During training, stronger motion-related conditions, such as pose, generally train better than weaker conditions like audio due to less ambiguity. When both conditions are present, the model tends to rely on the stronger condition for motion generation, preventing the weaker condition from learning effectively. Therefore, the method ensures that weaker conditions have a higher training ratio than stronger conditions.

During inference, to balance expressiveness and computational efficiency, classifier-free guidance (CFG) is applied specifically to audio and text across multiple conditions. The method proposes a CFG annealing strategy that progressively reduces the CFG magnitude throughout the inference process to mitigate issues such as pronounced wrinkles on the characters and compromised lip synchronization and motion expressiveness.

For evaluation, the authors obtained 18.7K hours of human-related data for training, with 13% selected using lipsync and pose visibility criteria, enabling audio and pose modalities. The data composition was adjusted to fit the omni-condition training strategy. For testing, the evaluation was conducted following the portrait animation method Loopy and the half-body animation method CyberHost. The results demonstrate that OmniHuman outperforms leading specialized models in both portrait and body animation tasks using a single model. For audio-driven animation, OmniHuman achieves the best results across all evaluated metrics, reflecting its overall effectiveness. The method also excels across almost all metrics in specific datasets.

Ablation studies were conducted to analyze the effects of the omni-condition training strategy. The results demonstrate that the ratio of audio condition-specific data training significantly affects the final performance. A high proportion of audio condition-specific data training reduces dynamic range and can cause failures with complex input images. Including weaker condition data at a 50% ratio yields satisfactory results (e.g., accurate lip-syncing and natural motion). However, excessive weaker condition data can hinder training, resulting in poorer correlation with the audio. When the model is trained with a low pose condition ratio and tested with only audio conditions, the model tends to generate intense, frequent co-speech gestures. On the other hand, if the model is trained with a high pose ratio, the model tends to rely on the pose condition to determine the human poses in the generated video. Consequently, given the input audio as the only driving signal, the generated results typically maintain a similar pose. In addition, a lower reference ratio leads to more pronounced error accumulation, characterized by increased noise and color shifts in the background, degrading performance. In contrast, a higher reference ratio ensures better alignment of the generated output with the quality and details of the original image.

In summary, OmniHuman leverages a mixed data training strategy with multimodality motion conditioning, overcoming the scarcity of high-quality data faced by previous methods. It significantly outperforms existing approaches, producing highly realistic human videos from weak signals, especially audio. OmniHuman supports images of any aspect ratio (portraits, half-body, or full-body) delivering lifelike, high-quality results across various scenarios.