Human-Centric Video Generation
- Human-Centric Video Generation (HCVG) is the synthesis of videos focused on human appearance, identity, and dynamic motion using multimodal cues.
- The HuMo framework leverages a Diffusion Transformer with minimal-invasive image injection and cross-attention modules to ensure subject consistency and audio-visual sync.
- Progressive multitask training with time-adaptive classifier-free guidance and curated dataset pipelines enables precise control and high fidelity in video synthesis.
Human-Centric Video Generation (HCVG) refers to the synthesis of videos where human appearance, identity, and motion are central, often driven by multimodal control signals such as text, reference images, and audio. The goal is to generate temporally coherent, visually realistic human videos in which both the subject’s visual attributes and dynamic behaviors are accurately controlled and synchronized across modalities. HCVG poses significant challenges due to the complexity and variability of human motion, the subtleties of identity preservation, the need for flexible multimodal conditional control, and the difficulty of acquiring paired datasets for joint training.
1. Problem Definition and Framework Architecture
HCVG addresses the task of generating human-centric videos from various control modalities: text descriptions (semantic control over appearance and motion), reference images (subject identity and appearance), and audio (driving speech and lip synchronization). The core technical problem is coordinating these heterogeneous modalities for fine-grained, subject-consistent, and synchronously animated human video synthesis.
HuMo introduces a unified, collaborative multimodal conditioning framework that simultaneously ingests and fuses text, reference image, and audio controls. Built on a Diffusion Transformer (DiT) backbone, the HuMo architecture is modular, permitting operation with any subset of the three modalities (text–image, text–audio, or all three jointly).
The principal pipeline is:
- Input Triplet: (1) free-form text prompt , (2) reference image , (3) corresponding speech audio
- Encoder Stack: Each modality is encoded independently—text with a Transformer, image with a VAE, and audio with a pretrained audio model.
- Latent Concatenation: The encoded image latent is concatenated (via a minimal-invasive strategy) to the video’s latent variable .
- Cross-Attention Modules: Audio features are integrated into the U-Net via a cross-attention mechanism; a mask predictor module focuses attention on facial/lip regions for synchronization.
- Progressive Task-Weighted Training: The training sequence is staged for subject preservation first (text–image), followed by audio-visual sync addition (text–image–audio), with a curriculum for smoothly ramping up the audio-driven task.
2. Multimodal Dataset Construction
A critical challenge in HCVG is the lack of large-scale, paired triplet datasets (text, image, audio). HuMo addresses this via a two-stage pipeline:
- Stage 1 (Text/Image): Starting from large video corpora (e.g., Koala-36M, OpenHumanVid), detailed, contextual text prompts are generated using vision–LLMs. Reference frames are selected by sparse frame sampling and then identity-matched against a massive image corpus, ensuring variation in pose, clothing, or background but consistency in subject identity.
- Stage 2 (Adding Audio): Speech segment detection and enhancement algorithms extract high-quality, lip-synchronized audio segments with precisely aligned facial regions.
This results in a dataset that, for each sample, provides (i) prompt, (ii) reference image(s), (iii) text, and (iv) aligned audio—enabling robust training and evaluation of collaborative control.
3. Progressive Multitask Training and Modal Collaboration
HuMo is trained in two progressive stages with explicit task separation and curriculum weighting:
- Stage 1: Subject Preservation (Text–Image)
- The model learns to generate temporally coherent, high-fidelity videos conditioned on text and reference images.
- The minimal-invasive image injection strategy concatenates to the video latent , and only self-attention modules are fine-tuned. This prevents disruption of the pretrained model’s text–visual alignment capacity.
- Stage 2: Audio-Visual Synchronization (Joint Modal)
- Audio conditioning is introduced via dedicated cross-attention layers and a focus-by-predicting strategy, which guides the network to align audio-driven changes with predicted facial (especially lip) regions.
- A progressive task weighting curriculum ([80% subject preservation, 20% audio–visual sync] [50%/50%]) is used to avoid catastrophic forgetting of subject preservation during the introduction of audio fusion.
- The focus-by-predicting mechanism involves a mask predictor attached to the final DiT block. The face region mask prediction is supervised by ground-truth mask using a size-aware binary cross-entropy:
(where , are the latent spatial dimensions).
4. Flexible and Fine-Grained Control at Inference
For inference-time multimodal adaptability, HuMo implements a time-adaptive Classifier-Free Guidance (CFG) mechanism:
- Modality-specific guidance strengths—, , —are introduced.
- At inference, the denoising step combines predictions as:
- Early in denoising, structure is dominated by text/image; later steps increase audio’s relative influence, tuning fine-grained synchronization properties without degrading subject fidelity.
Parallel flow matching is used during training, with objective:
where .
5. Experimental Evaluation and Comparative Results
HuMo outperforms specialized state-of-the-art models on both subject preservation and audio-visual sync sub-tasks across varying scales (1.7B and 17B parameters):
- Subject Preservation Metrics: Higher Aesthetics (AES), Image Quality Assessment (IQA), Human Structure Plausibility (HSP), Text–Video Alignment (TVA), identity scores (ID-Cur, ID-Glink, CLIP-I, DINO-I), and consistent frame-wise appearance.
- Audio–Visual Sync Metrics: Superior scores on Sync-C and Sync-D match rates, reflecting more accurate lip-synch and facial articulation with audio.
- Ablation Studies: Validate the necessity of progressive training, minimal-invasive injection (which avoids destructive updates or information bottlenecks), and the focus-by-predicting module for localized audio control.
Empirical results confirm that explicit collaboration between subject identity (reference image), semantic structure (text), and temporally localized cues (audio and facial mask) is essential for robust multimodal control in HCVG.
6. Theoretical Implications and Future Directions
HuMo’s methodology demonstrates that:
- Progressive, curriculum-based multitask training is instrumental in avoiding modal interference and catastrophic forgetting during joint conditional learning.
- Minimal-invasive feature injection combined with targeted, localized supervision (via mask prediction) retains pretrained backbone capabilities while allowing precise, scalable subject and motion control.
- Time-adaptive CFG enables practitioners to vary conditional strength dynamically, which is critical for tuning realism, prompt faithfulness, and cross-modal consistency in deployed systems.
Potential future directions include:
- Extension to additional modalities (e.g., gesture, depth, environment context) for richer interaction scenarios.
- Larger and more diverse multimodal datasets, especially with dense, event-aligned triplet supervision for real-world generalizability.
- Integration with human feedback or preference optimization for subjective perceptual improvement at scale (as in DPO paradigms).
7. Applications and Broader Impact
Advances embodied by HuMo directly facilitate:
- Digital Human Creation: Film, advertising, and social media applications requiring synthesis of new scenes, expressions, or performances from arbitrary prompts, visuals, and audio.
- Virtual Avatars and Telepresence: Realistic, responsive digital humans for conferencing, virtual assistants, or metaverse environments, with personalized appearance and expressive synchronization.
- Accessible Content and Creative Tools: Platforms for automatic dubbing, real-time animation, and content localization, empowering users to generate controllable, personalized human video content from minimal input.
The unified framework established by HuMo sets a new standard for collaborative, high-fidelity human video synthesis, with technical insights and practical strategies generalizable to next-generation multimodal and human-centric media generation systems (Chen et al., 10 Sep 2025).