Human4DiT: 360-degree Human Video Generation with 4D Diffusion Transformer (2405.17405v2)

Published 27 May 2024 in cs.CV

Abstract: We present a novel approach for generating 360-degree high-quality, spatio-temporally coherent human videos from a single image. Our framework combines the strengths of diffusion transformers for capturing global correlations across viewpoints and time, and CNNs for accurate condition injection. The core is a hierarchical 4D transformer architecture that factorizes self-attention across views, time steps, and spatial dimensions, enabling efficient modeling of the 4D space. Precise conditioning is achieved by injecting human identity, camera parameters, and temporal signals into the respective transformers. To train this model, we collect a multi-dimensional dataset spanning images, videos, multi-view data, and limited 4D footage, along with a tailored multi-dimensional training strategy. Our approach overcomes the limitations of previous methods based on generative adversarial networks or vanilla diffusion models, which struggle with complex motions, viewpoint changes, and generalization. Through extensive experiments, we demonstrate our method's ability to synthesize 360-degree realistic, coherent human motion videos, paving the way for advanced multimedia applications in areas such as virtual reality and animation.

References (50)

Citations (3)

View on Semantic Scholar

Summary

The paper introduces a novel 4D diffusion transformer that effectively models spatial, temporal, and viewpoint dimensions for coherent human video generation.
It integrates control signals such as SMPL, identity, time, and camera parameters to enable precise manipulation of human motion.
The method outperforms state-of-the-art approaches with significant gains in PSNR, SSIM, LPIPS, and FVD, paving the way for advanced multimedia applications.

Overview of Human4DiT: A Novel Approach to Generating Spatio-Temporally Coherent Human Videos

The paper introduces Human4DiT, an innovative framework aimed at generating high-quality, spatio-temporally coherent human videos from a single reference image. The core of this framework is a cascaded 4D diffusion transformer architecture, which efficiently models correlations across spatial, temporal, and viewpoint dimensions.

Key Contributions

The authors highlight several notable advancements in their work:

Novel 4D Diffusion Transformer Architecture: The proposed architecture factorizes attention mechanisms across 2D spatial dimensions, temporal sequences, and different viewpoints. This factorization allows efficient modeling of complex human motions in a 4D space while reducing computational overhead.
Integration of Control Signals: The model incorporates various control signals—including SMPL (Skinned Multi-Person Linear) representations, human identity, time, and camera parameters—into respective network modules for precise control.
Multi-Dimensional Dataset and Training: A comprehensive dataset, including images, videos, multi-view videos, 3D, and 4D scans, is curated for training. A multi-dimensional training strategy is employed to fully leverage the data modalities.
Efficient Sampling Strategy: For the inference stage, a novel spatio-temporally consistent diffusion sampling strategy is proposed, ensuring the generation of long, coherent videos across varying viewpoints.

Architectural Framework

The core of Human4DiT is its 4D diffusion transformer, which is decomposed into three interconnected transformer blocks:

2D Image Transformer Block: Captures spatial self-attention within each frame.
View Transformer Block: Models correlations across different viewpoints by considering variations in camera parameters.
Temporal Transformer Block: Captures temporal correlations across time steps.

These blocks are cascaded to form a complete 4D transformer block, enhancing the model's capacity to generate coherent outputs by maintaining consistency across spatial, temporal, and viewpoint dimensions.

Control Condition Injection Modules

The framework includes several vital modules:

Camera Control Module: Injects camera viewpoint control by encoding camera parameters and mapping them to the latent space.
Temporal Embedding Module: Applies positional encoding to time steps, ensuring temporal consistency.
SMPL Control Module: Uses SMPL-derived normal maps to provide detailed human body information.
Human Identity Reference Module: Maintains identity consistency by using UNet and CLIP embeddings.

Training and Dataset

The collected dataset for training is comprehensive, featuring diverse modalities:

Images: Enhance the model's ability to capture human identity from static references.
Videos: Provide temporal dynamics for the model.
Multi-View Data: Enable the model to learn correlations across different viewpoints.
3D and 4D Scans: Offer detailed spatial and temporal information for robust training.

A mixed training strategy leverages these modalities differently, ensuring that each dimension contributes effectively to the model's learning process.

Performance and Evaluation

The proposed Human4DiT method is rigorously evaluated against state-of-the-art approaches, including Disco, MagicAnimate, AnimateAnyone, and Champ. The evaluations are conducted across different video generation scenarios: monocular video, multi-view video, 3D static video, and free-view video.

Quantitative results demonstrate significant improvements in metrics such as PSNR, SSIM, LPIPS, and FVD across all scenarios, underlining the superior performance of Human4DiT. Qualitative results show that the model produces more natural and coherent videos, efficiently handling complex motions and viewpoint changes.

Implications and Future Prospects

The advancements presented in Human4DiT have profound implications for multimedia applications, virtual reality, animation, gaming, and human-computer interaction. By tackling the challenges of generating spatio-temporally consistent human videos, this approach sets a new benchmark in generative modeling.

Conclusion

Human4DiT represents a substantial step forward in the field of human video generation. The innovative 4D transformer architecture, combined with a holistic training strategy and efficient sampling methods, enables the synthesis of high-quality, coherent videos from a mere single reference image. The implications for future research are immense, and further exploration into explicit 4D representations and enhanced detail generation (e.g., fingers and accessories) holds promise for even more sophisticated applications.

PDF Markdown

Related Papers

Tweets

https://twitter.com/_akhaliq/status/1795290741662429487

YouTube

Show All Videos