Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
175 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Human4DiT: 360-degree Human Video Generation with 4D Diffusion Transformer (2405.17405v2)

Published 27 May 2024 in cs.CV

Abstract: We present a novel approach for generating 360-degree high-quality, spatio-temporally coherent human videos from a single image. Our framework combines the strengths of diffusion transformers for capturing global correlations across viewpoints and time, and CNNs for accurate condition injection. The core is a hierarchical 4D transformer architecture that factorizes self-attention across views, time steps, and spatial dimensions, enabling efficient modeling of the 4D space. Precise conditioning is achieved by injecting human identity, camera parameters, and temporal signals into the respective transformers. To train this model, we collect a multi-dimensional dataset spanning images, videos, multi-view data, and limited 4D footage, along with a tailored multi-dimensional training strategy. Our approach overcomes the limitations of previous methods based on generative adversarial networks or vanilla diffusion models, which struggle with complex motions, viewpoint changes, and generalization. Through extensive experiments, we demonstrate our method's ability to synthesize 360-degree realistic, coherent human motion videos, paving the way for advanced multimedia applications in areas such as virtual reality and animation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (50)
  1. All are worth words: A vit backbone for diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22669–22679, 2023.
  2. Person image synthesis via denoising diffusion model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5968–5976, 2023.
  3. Bedlam: A synthetic dataset of bodies exhibiting detailed lifelike animated motion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8726–8737, 2023.
  4. Dna-rendering: A diverse neural actor repository for high-fidelity human-centric rendering. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 19982–19993, 2023.
  5. Generative adversarial nets. Advances in neural information processing systems, 27, 2014.
  6. Cameractrl: Enabling camera control for text-to-video generation, 2024.
  7. Animate anyone: Consistent and controllable image-to-video synthesis for character animation. arXiv preprint arXiv:2311.17117, 2023.
  8. Learning high fidelity depths of dressed humans by watching social media dance videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12753–12762, 2021.
  9. Human-art: A versatile human-centric dataset bridging natural and artificial scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 618–629, 2023.
  10. Dreampose: Fashion video synthesis with stable diffusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 22680–22690, 2023.
  11. Same: Skeleton-agnostic motion embedding for character animation. In SIGGRAPH Asia 2023 Conference Papers, pages 1–11, 2023.
  12. Motion-x: A large-scale 3d expressive whole-body human motion dataset. Advances in Neural Information Processing Systems, 36, 2024.
  13. Zero-1-to-3: Zero-shot one image to 3d object. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9298–9309, 2023a.
  14. Syncdreamer: Generating multiview-consistent images from a single-view image. arXiv preprint arXiv:2309.03453, 2023b.
  15. Wonder3d: Single image to 3d using cross-domain diffusion. arXiv preprint arXiv:2310.15008, 2023.
  16. Smpl: A skinned multi-person linear model. In Seminal Graphics Papers: Pushing the Boundaries, Volume 2, pages 851–866. 2023.
  17. Vdt: General-purpose video diffusion transformers via mask modeling. In The Twelfth International Conference on Learning Representations, 2023.
  18. Latte: Latent diffusion transformer for video generation. arXiv preprint arXiv:2401.03048, 2024.
  19. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.
  20. OpenAI. Video generation models as world simulators. https://openai.com/index/video-generation-models-as-world-simulators/, 2024. Accessed: 2024-05-19.
  21. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4195–4205, 2023.
  22. Improving language understanding by generative pre-training. 2018.
  23. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  24. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
  25. U-net: Convolutional networks for biomedical image segmentation. In Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18, pages 234–241. Springer, 2015.
  26. Zero123++: a single image to consistent multi-view diffusion base model. arXiv preprint arXiv:2310.15110, 2023a.
  27. Mvdream: Multi-view diffusion for 3d generation. arXiv preprint arXiv:2308.16512, 2023b.
  28. Deformable gans for pose-based human image generation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3408–3416, 2018.
  29. Appearance and pose-conditioned human image generation using deformable gans. IEEE transactions on pattern analysis and machine intelligence, 43(4):1156–1171, 2019a.
  30. First order motion model for image animation. Advances in neural information processing systems, 32, 2019b.
  31. Segmenter: Transformer for semantic segmentation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 7262–7272, 2021.
  32. A good image generator is what you need for high-resolution video synthesis. arXiv preprint arXiv:2104.15069, 2021.
  33. Training data-efficient image transformers & distillation through attention. In International conference on machine learning, pages 10347–10357. PMLR, 2021.
  34. Aist dance video database: Multi-genre, multi-dancer, and multi-camera database for dance information processing. In Proceedings of the 20th International Society for Music Information Retrieval Conference, ISMIR 2019, Delft, Netherlands, 2019.
  35. Twindom. Twindom 3d avatar dataset, 2022.
  36. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  37. Disco: Disentangled control for referring human dance generation in real world. arXiv e-prints, pages arXiv–2307, 2023.
  38. One-shot free-view neural talking-head synthesis for video conferencing. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10039–10049, 2021.
  39. G3an: Disentangling appearance and motion for video generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5264–5273, 2020.
  40. Segformer: Simple and efficient design for semantic segmentation with transformers. Advances in neural information processing systems, 34:12077–12090, 2021.
  41. Magicanimate: Temporally consistent human image animation using diffusion model. arXiv preprint arXiv:2311.16498, 2023.
  42. Direct-a-video: Customized video generation with user-directed camera movement and object motion. arXiv preprint arXiv:2402.03162, 2024.
  43. Generating holistic 3d human motion from speech. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 469–480, 2023.
  44. Function4d: Real-time human volumetric capture from very sparse consumer rgbd sensors. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR2021), 2021.
  45. Tokens-to-token vit: Training vision transformers from scratch on imagenet. In Proceedings of the IEEE/CVF international conference on computer vision, pages 558–567, 2021.
  46. Closet: Modeling clothed humans on continuous surface with explicit template decomposition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 501–511, 2023a.
  47. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023b.
  48. Fast training of diffusion models with masked transformers. arXiv preprint arXiv:2306.09305, 2023.
  49. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6881–6890, 2021.
  50. Champ: Controllable and consistent human image animation with 3d parametric guidance. arXiv preprint arXiv:2403.14781, 2024.
Citations (3)

Summary

  • The paper introduces a novel 4D diffusion transformer that effectively models spatial, temporal, and viewpoint dimensions for coherent human video generation.
  • It integrates control signals such as SMPL, identity, time, and camera parameters to enable precise manipulation of human motion.
  • The method outperforms state-of-the-art approaches with significant gains in PSNR, SSIM, LPIPS, and FVD, paving the way for advanced multimedia applications.

Overview of Human4DiT: A Novel Approach to Generating Spatio-Temporally Coherent Human Videos

The paper introduces Human4DiT, an innovative framework aimed at generating high-quality, spatio-temporally coherent human videos from a single reference image. The core of this framework is a cascaded 4D diffusion transformer architecture, which efficiently models correlations across spatial, temporal, and viewpoint dimensions.

Key Contributions

The authors highlight several notable advancements in their work:

  • Novel 4D Diffusion Transformer Architecture: The proposed architecture factorizes attention mechanisms across 2D spatial dimensions, temporal sequences, and different viewpoints. This factorization allows efficient modeling of complex human motions in a 4D space while reducing computational overhead.
  • Integration of Control Signals: The model incorporates various control signals—including SMPL (Skinned Multi-Person Linear) representations, human identity, time, and camera parameters—into respective network modules for precise control.
  • Multi-Dimensional Dataset and Training: A comprehensive dataset, including images, videos, multi-view videos, 3D, and 4D scans, is curated for training. A multi-dimensional training strategy is employed to fully leverage the data modalities.
  • Efficient Sampling Strategy: For the inference stage, a novel spatio-temporally consistent diffusion sampling strategy is proposed, ensuring the generation of long, coherent videos across varying viewpoints.

Architectural Framework

The core of Human4DiT is its 4D diffusion transformer, which is decomposed into three interconnected transformer blocks:

  • 2D Image Transformer Block: Captures spatial self-attention within each frame.
  • View Transformer Block: Models correlations across different viewpoints by considering variations in camera parameters.
  • Temporal Transformer Block: Captures temporal correlations across time steps.

These blocks are cascaded to form a complete 4D transformer block, enhancing the model's capacity to generate coherent outputs by maintaining consistency across spatial, temporal, and viewpoint dimensions.

Control Condition Injection Modules

The framework includes several vital modules:

  • Camera Control Module: Injects camera viewpoint control by encoding camera parameters and mapping them to the latent space.
  • Temporal Embedding Module: Applies positional encoding to time steps, ensuring temporal consistency.
  • SMPL Control Module: Uses SMPL-derived normal maps to provide detailed human body information.
  • Human Identity Reference Module: Maintains identity consistency by using UNet and CLIP embeddings.

Training and Dataset

The collected dataset for training is comprehensive, featuring diverse modalities:

  • Images: Enhance the model's ability to capture human identity from static references.
  • Videos: Provide temporal dynamics for the model.
  • Multi-View Data: Enable the model to learn correlations across different viewpoints.
  • 3D and 4D Scans: Offer detailed spatial and temporal information for robust training.

A mixed training strategy leverages these modalities differently, ensuring that each dimension contributes effectively to the model's learning process.

Performance and Evaluation

The proposed Human4DiT method is rigorously evaluated against state-of-the-art approaches, including Disco, MagicAnimate, AnimateAnyone, and Champ. The evaluations are conducted across different video generation scenarios: monocular video, multi-view video, 3D static video, and free-view video.

Quantitative results demonstrate significant improvements in metrics such as PSNR, SSIM, LPIPS, and FVD across all scenarios, underlining the superior performance of Human4DiT. Qualitative results show that the model produces more natural and coherent videos, efficiently handling complex motions and viewpoint changes.

Implications and Future Prospects

The advancements presented in Human4DiT have profound implications for multimedia applications, virtual reality, animation, gaming, and human-computer interaction. By tackling the challenges of generating spatio-temporally consistent human videos, this approach sets a new benchmark in generative modeling.

Conclusion

Human4DiT represents a substantial step forward in the field of human video generation. The innovative 4D transformer architecture, combined with a holistic training strategy and efficient sampling methods, enables the synthesis of high-quality, coherent videos from a mere single reference image. The implications for future research are immense, and further exploration into explicit 4D representations and enhanced detail generation (e.g., fingers and accessories) holds promise for even more sophisticated applications.

X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com