OpenHumanVid: A Large-Scale High-Quality Dataset for Enhancing Human-Centric Video Generation (2412.00115v3)

Published 28 Nov 2024 in cs.CV

Abstract: Recent advancements in visual generation technologies have markedly increased the scale and availability of video datasets, which are crucial for training effective video generation models. However, a significant lack of high-quality, human-centric video datasets presents a challenge to progress in this field. To bridge this gap, we introduce OpenHumanVid, a large-scale and high-quality human-centric video dataset characterized by precise and detailed captions that encompass both human appearance and motion states, along with supplementary human motion conditions, including skeleton sequences and speech audio. To validate the efficacy of this dataset and the associated training strategies, we propose an extension of existing classical diffusion transformer architectures and conduct further pretraining of our models on the proposed dataset. Our findings yield two critical insights: First, the incorporation of a large-scale, high-quality dataset substantially enhances evaluation metrics for generated human videos while preserving performance in general video generation tasks. Second, the effective alignment of text with human appearance, human motion, and facial motion is essential for producing high-quality video outputs. Based on these insights and corresponding methodologies, the straightforward extended network trained on the proposed dataset demonstrates an obvious improvement in the generation of human-centric videos. Project page https://fudan-generative-vision.github.io/OpenHumanVid

Summary

The paper introduces a dataset with over 52.3 million video clips and 70.6K hours of high-res human-centric content.
It utilizes advanced techniques like LoRA, DWpose, and SyncNet to achieve precise caption alignment and improved motion fidelity.
The dataset significantly elevates video synthesis performance by enhancing metrics such as human motion alignment and facial consistency.

Overview of OpenHumanVid: Enhancing Human-Centric Video Generation

The paper "OpenHumanVid: A Large-Scale High-Quality Dataset for Enhancing Human-Centric Video Generation" presents an innovative approach to addressing the scarcity of high-quality human-focused video datasets, which are pivotal for advancing video generation models. The authors introduce OpenHumanVid, a comprehensive dataset of substantial scale and quality, aimed at mitigating the inherent limitations faced by existing models in terms of texture and motion fidelity concerning human figures.

Dataset Composition and Characteristics

OpenHumanVid comprises over 52.3 million video clips, equating to 70.6 thousand hours of human-centric content sourced from culturally rich media like films, television series, and documentaries. This dataset is characterized by high-resolution videos (720P or 1080P) and is paired with precisely constructed captions that describe human appearances and motion. Furthermore, OpenHumanVid includes additional motion conditions such as human skeleton sequences and speech audio, which are drawn from advanced algorithms like DWpose and SyncNet, respectively.

The qualitative refinement of the dataset involves several layers of data processing focused on video and audio quality, as well as semantically aligned text-video relationships, using advanced models like LlaMA for caption structuring and alignment.

Methodology and Analysis

The authors experiment with extending traditional diffusion transformer architectures by employing Low-Rank Adaptation (LoRA) to enhance the transfer and fine-tuning capabilities of these models when trained on OpenHumanVid. By further pretraining these models on the dataset, the researchers prove two principal findings: firstly, that high-quality data contributes significantly to producing more convincing and coherent human video outputs, and secondly, that optimal alignment of textual prompts with human visual and motion data is crucial for achieving high-grade video synthesis. Moreover, the empirical evaluation showcases notable improvements in metrics like human motion alignment and facial consistency in videos, reinforcing the dataset's efficacy.

Implications and Future Directions

OpenHumanVid potentially revolutionizes human video generation model training by offering an extensive corpus enriched with diverse human identities and scenarios. Practically, this dataset heavily impacts fields such as virtual reality, gaming, and human-computer interaction, potentially enabling more nuanced and accurately generated human avatars. Theoretically, the dataset provides a fertile ground for exploring more complex generative tasks and algorithms involving nuanced motion dynamics and subtle human expressions.

Future advancements may concentrate on resolving current limitations such as the reliance on existing multimodal models for caption generation and increasing diversity in human expressions and actions. Further research could also address ethical concerns, ensuring that such a dataset contributes positively to social welfare, supporting fair representation without facilitating derisive usages like deepfakes.

In conclusion, the introduction of OpenHumanVid presents a significant stride in human-centric video generation research, offering a texture-rich, contextually relevant dataset that bridges notable gaps in the training of cutting-edge video generative models. As models increasingly demand high-quality and vast datasets to achieve nuanced synthesis, OpenHumanVid stands as a crucial resource in advancing the frontiers of video generation technology.

PDF Markdown

Related Papers

GitHub

Homepage

HackerNews

OpenHumanVid: Large-Scale High-Quality Dataset Enhancing Human-Centric Video Gen (2 points, 0 comments)