OmniHuman: A Large-scale Dataset and Benchmark for Human-Centric Video Generation

Published 20 Apr 2026 in cs.CV | (2604.18326v1)

Abstract: Recent advancements in audio-video joint generation models have demonstrated impressive capabilities in content creation. However, generating high-fidelity human-centric videos in complex, real-world physical scenes remains a significant challenge. We identify that the root cause lies in the structural deficiencies of existing datasets across three dimensions: limited global scene and camera diversity, sparse interaction modeling (both person-person and person-object), and insufficient individual attribute alignment. To bridge these gaps, we present OmniHuman, a large-scale, multi-scene dataset designed for fine-grained human modeling. OmniHuman provides a hierarchical annotation covering video-level scenes, frame-level interactions, and individual-level attributes. To facilitate this, we develop a fully automated pipeline for high-quality data collection and multi-modal annotation. Complementary to the dataset, we establish the OmniHuman Benchmark (OHBench), a three-level evaluation system that provides a scientific diagnosis for human-centric audio-video synthesis. Crucially, OHBench introduces metrics that are highly consistent with human perception, filling the gaps in existing benchmarks by providing a comprehensive diagnosis across global scenes, relational interactions, and individual attributes.

Abstract PDF Upgrade to Chat

Authors (9)

Summary

The paper introduces OmniHuman, a dataset with 1M videos, 1800 hours, and 80K identities that addresses gaps in scene diversity, interaction, and attribute alignment.
It details a fully automated hierarchical annotation pipeline using advanced detectors and pose estimation to ensure high spatial-temporal quality.
The study evaluates multiple generative models on OHBench, showing significant performance gains in audio-video synchronization and interaction realism through fine-tuning.

OmniHuman: Dataset and Benchmarking for Human-Centric Video Generation

Motivation and Dataset Deficiencies in Human-Centric Video Generation

Recent progress in audio-video joint generation models has yielded advances in content creation, yet human-centric video generation in complex, physical scenes remains problematic. Existing datasets are structurally limited along three axes: (i) global scene and camera diversity, (ii) sparse interaction modeling (person-person and person-object), and (iii) insufficient individual attribute alignment. These constraints fundamentally inhibit generalization and perceptual fidelity in systems targeting real-world scenarios, causing semantic mismatches and unstable generation artifacts.

Figure 1: OmniHuman: a 1M-video, 1800 hours, 80K-identity dataset with hierarchical annotations covering diverse natural scenes and social interactions.

OmniHuman: Hierarchical Dataset Construction and Annotation Pipeline

OmniHuman addresses these structural gaps by introducing a large-scale, richly annotated dataset with hierarchical labeling capable of supporting fine-grained human modeling at video, frame, and individual levels. The dataset spans 1 million videos, 1,800 hours, and 80,000 distinct identities, emphasizing multi-scene coverage and high-definition content. The fully automated data curation pipeline involves:

Robust filtering for spatial-temporal consistency, aesthetics, and motion quality.
High-fidelity subject tracking using advanced detectors (YOLOv11, MOTRv2) with loss mitigation and identity linking via ArcFace embeddings.
Comprehensive pose estimation (134 keypoints per instance) for full-body tracking, facial clarity scoring, and identity assignment through embedding similarity.
Audio governance utilizing Demucs for multi-source separation, speaker diarization with 3DSpeaker, and cross-modal alignment via SyncNet to effect high-precision audio-visual synchronization.
Hierarchical multi-modal caption generation employing Qwen3-Omni, leveraging two-stage inference with placeholder anchors, referential insertion, and strict validation checks for minimizing hallucination and ensuring attribute consistency.
Figure 2: OmniHuman employs a fully automated pipeline for high-quality data collection and fine-grained annotation, with each module applying progressive filtering to ensure both video quality and annotation accuracy.

Rich statistical analysis reveals comprehensive scenario, content-type, resolution, and duration distributions, validating the dataset’s extensive domain coverage.

Figure 3: Statistical analysis of the OmniHuman dataset composition.

OHBench: Perception-Aligned Benchmarking for Audio-Video Generation

OmniHuman is complemented by OHBench, a benchmark suite designed for scientific diagnosis of human-centric audio-video synthesis. The benchmark is stratified into three evaluation levels—global, interactional, and individual—across seven diagnostic dimensions:

Global Level: Video quality (imaging quality, motion intensity, background plausibility), audio quality (Audiobox-derived aesthetics, distributional metrics), and multi-modal synchronization.
Interaction Level: Evaluation of person-person and person-object interactions, including social naturalness, audio-visual assignment accuracy, identity drift, object consistency, and contact realism.
Individual Level: Assessment of subject-video attributes (ID fidelity, attribute consistency, lip sync) and subject-audio attributes (pronunciation accuracy via WER, perceptual speech quality via DNSMOS OVRL).

OHBench samples from OmniHuman while maintaining domain gaps, ensuring robust evaluation for high-level audio-visual tasks, speech-to-video generation, controllable editing, and downstream speech synthesis.

Figure 4: Distribution of subject categories, scene types, and shot types in OHBench.

Experimental Evaluation: Performance Landscape and Impact of OmniHuman

Systematic evaluation on OHBench involves both open-source (Universe-1, UniAVGen, Ovi, LTX-2, MOVA) and closed-source (Veo3.1, Wan2.5, Sora2, kling2.6, SeedDance1.5-pro) models. Results demonstrate:

Closed-source models dominate in video quality, interaction rationality, and individual-level attributes, attributable to massive training datasets and post-training alignment paradigms.
Ovi exhibits strong cross-modal consistency and lip sync but lacks robust interaction modeling owing to limited training samples.
LTX-2 surpasses other open-source models in dyadic interaction and listener realism, with performance improvements after fine-tuning on OmniHuman data driving substantial gains in audio quality (+25.9% KL, +12.3% FD, +11.9% AbS), multi-modal alignment (+25.0% T-A, +11.1% V-A), dynamic degree (+10.7%), and identity consistency (+6.1% IC, +4.8% IC*).
Open-source models show more balanced performance distributions and are less prone to domain artifacts, especially in global quality metrics.

Fine-tuning LTX-2 with only 20% of OmniHuman data demonstrates marked improvements across all evaluation axes, evidencing the dataset’s value in enhancing open-source capabilities for complex human-centric audio-video generation.

Figure 5: Performance distribution of 10 models across seven dimensions on OHBench for audio-video joint generation task.

Practical and Theoretical Implications

OmniHuman and OHBench collectively offer:

A scalable blueprint for human-centric video generation, bridging structural data gaps and providing robust perceptual alignment for benchmarking and evaluation.
Empirical evidence supporting transfer learning and fine-tuning of generative models on high-quality, domain-diverse annotated datasets.
The potential for advancing multimodal generation systems beyond single-subject, controlled settings to complex social and physical scenarios, with implications for downstream applications in interactive media, synthetic film, and embodied AI.
The benchmark’s perceptual metrics and hierarchical evaluation framework are poised for adoption in rigorous ablation studies, generalization tests, and systematic model comparison.

Ongoing development is anticipated in expanding scenario diversity, improving modeling of distant views and multi-agent interactions, and enhancing perceptual alignment via crowd-sourcing or expert evaluation.

Conclusion

OmniHuman, paired with OHBench, marks a significant advance in the systematic modeling and evaluation of human-centric video generation. Its hierarchical annotation, automated curation, and perception-aligned metrics facilitate comprehensive diagnosis and reliably boost model performance with minimal fine-tuning. The methodologies and findings set a foundation for future multimodal generative research, promising robust generalization in real-world, interaction-rich scenarios (2604.18326).

Markdown Report Issue