Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
129 tokens/sec
GPT-4o
28 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SurMo: Surface-based 4D Motion Modeling for Dynamic Human Rendering (2404.01225v2)

Published 1 Apr 2024 in cs.CV

Abstract: Dynamic human rendering from video sequences has achieved remarkable progress by formulating the rendering as a mapping from static poses to human images. However, existing methods focus on the human appearance reconstruction of every single frame while the temporal motion relations are not fully explored. In this paper, we propose a new 4D motion modeling paradigm, SurMo, that jointly models the temporal dynamics and human appearances in a unified framework with three key designs: 1) Surface-based motion encoding that models 4D human motions with an efficient compact surface-based triplane. It encodes both spatial and temporal motion relations on the dense surface manifold of a statistical body template, which inherits body topology priors for generalizable novel view synthesis with sparse training observations. 2) Physical motion decoding that is designed to encourage physical motion learning by decoding the motion triplane features at timestep t to predict both spatial derivatives and temporal derivatives at the next timestep t+1 in the training stage. 3) 4D appearance decoding that renders the motion triplanes into images by an efficient volumetric surface-conditioned renderer that focuses on the rendering of body surfaces with motion learning conditioning. Extensive experiments validate the state-of-the-art performance of our new paradigm and illustrate the expressiveness of surface-based motion triplanes for rendering high-fidelity view-consistent humans with fast motions and even motion-dependent shadows. Our project page is at: https://taohuumd.github.io/projects/SurMo/

Definition Search Book Streamline Icon: https://streamlinehq.com
References (78)
  1. Deep video‐based performance cloning. Computer Graphics Forum, 38, 2019.
  2. Universal capture: image-based facial animation for ”the matrix reloaded”. In SIGGRAPH ’03, 2003.
  3. Free-viewpoint video of human actors. ACM SIGGRAPH 2003 Papers, 2003.
  4. Everybody dance now. ICCV, pages 5932–5941, 2019.
  5. Efficient geometry-aware 3d generative adversarial networks. ArXiv, abs/2112.07945, 2021.
  6. Efficient geometry-aware 3d generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16123–16133, 2022.
  7. Animatable neural radiance fields from monocular rgb video. ArXiv, abs/2106.13629, 2021.
  8. Learning implicit fields for generative shape modeling. CVPR, 2019.
  9. Nasa neural articulated shape approximation. In ECCV, 2020.
  10. Learning neural volumetric representations of dynamic humans in minutes. In CVPR, 2023.
  11. Generative adversarial nets. In NIPS, 2014.
  12. Coordinate-based texture inpainting for pose-guided human image generation. CVPR, pages 12127–12136, 2019.
  13. Stylenerf: A style-based 3d-aware generator for high-resolution image synthesis. ArXiv, abs/2110.08985, 2021.
  14. Real-time deep dynamic characters. ACM Transactions on Graphics (TOG), 40:1 – 16, 2021.
  15. Deep residual learning for image recognition. CVPR, pages 770–778, 2016.
  16. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In NIPS, 2017.
  17. Headnerf: A real-time nerf-based parametric head model. ArXiv, abs/2112.05637, 2021.
  18. Learning to generate dense point clouds with textures on multiple categories. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 2170–2179, January 2021.
  19. Egorenderer: Rendering human avatars from egocentric camera images. In ICCV, 2021.
  20. Hvtr++: Image and pose driven human avatars using hybrid volumetric-textural rendering. IEEE Transactions on Visualization and Computer Graphics, pages 1–15, 2023.
  21. Hvtr: Hybrid volumetric-textural rendering for human avatars. 3DV, 2022.
  22. Arch: Animatable reconstruction of clothed humans. 2020 (CVPR), pages 3090–3099, 2020.
  23. Image-to-image translation with conditional adversarial networks. CVPR, pages 5967–5976, 2017.
  24. Perceptual losses for real-time style transfer and super-resolution. volume 9906, pages 694–711, 10 2016.
  25. Ray tracing volume densities. Proceedings of the 11th annual conference on Computer graphics and interactive techniques, 1984.
  26. Adam: A method for stochastic optimization. In ICLR, 2015.
  27. Towards an understanding of our world by GANing videos in the wild. arXiv:1711.11453, 2017.
  28. Learn to dance with aist++: Music conditioned 3d dance generation, 2021.
  29. Posevocab: Learning joint-structured pose embeddings for human avatar modeling. In ACM SIGGRAPH Conference Proceedings, 2023.
  30. Neural actor: Neural free-view synthesis of human actors with pose control. TOG, 40, 2021.
  31. Neural human video rendering by learning dynamic textures and rendering-to-video translation. IEEE Transactions on Visualization and Computer Graphics, 05 2020.
  32. Neural rendering and reenactment of human actor videos. ACM Transactions on Graphics (TOG), 2019.
  33. Sphereface: Deep hypersphere embedding for face recognition. CVPR, pages 6738–6746, 2017.
  34. Smpl: a skinned multi-person linear model. ACM Trans. Graph., 34:248:1–16, 2015.
  35. Pose guided person image generation. In NeurIPS, pages 405–415, 2017.
  36. Disentangled person image generation. CVPR, 2018.
  37. Scale: Modeling clothed humans with a surface codec of articulated local elements. In CVPR, 2021.
  38. The power of points for modeling humans in clothing. In ICCV, 2021.
  39. Occupancy networks: Learning 3d reconstruction in function space. CVPR, 2019.
  40. Deep level sets: Implicit surface representations for 3d shape inference. ArXiv, 2019.
  41. Leap: Learning articulated occupancy of people. In CVPR, 2021.
  42. Dense pose transfer. ECCV, 2018.
  43. Giraffe: Representing scenes as compositional generative neural feature fields. CVPR, pages 11448–11459, 2021.
  44. Neural articulated radiance field. In IEEE/CVF ICCV, 2021.
  45. Stylesdf: High-resolution 3d-consistent image and geometry generation. ArXiv, abs/2112.11427, 2021.
  46. Npms: Neural parametric models for 3d deformable shapes. In IEEE/CVF ICCV, 2021.
  47. Deepsdf: Learning continuous signed distance functions for shape representation. CVPR, 2019.
  48. Deepsdf: Learning continuous signed distance functions for shape representation. 2019 (CVPR), pages 165–174, 2019.
  49. Animatable neural radiance fields for modeling dynamic human bodies. In ICCV, 2021.
  50. Neural body: Implicit neural representations with structured latent codes for novel view synthesis of dynamic humans. CVPR, 2021.
  51. Smplpix: Neural avatars from 3d human models. WACV, 2021.
  52. Unsupervised person image synthesis in arbitrary poses. In CVPR, June 2018.
  53. Anr: Articulated neural rendering for virtual avatars. CVPR, pages 3721–3730, 2021.
  54. Drivable volumetric avatars using texel-aligned features. ACM SIGGRAPH, 2022.
  55. Pifu: Pixel-aligned implicit function for high-resolution clothed human digitization. IEEE/CVF ICCV, pages 2304–2314, 2019.
  56. Pifuhd: Multi-level pixel-aligned implicit function for high-resolution 3d human digitization. 2020 (CVPR), pages 81–90, 2020.
  57. Scanimate: Weakly supervised learning of skinned clothed avatar networks. 2021 (CVPR), pages 2885–2896, 2021.
  58. Neural re-rendering of humans from a single image. In ECCV, 2020.
  59. Deformable GANs for pose-based human image generation. In CVPR, 2018.
  60. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2015.
  61. A-nerf: Articulated neural radiance fields for learning human shape, appearance, and pose. In NeurIPS, 2021.
  62. State of the art on neural rendering. Computer Graphics Forum, 2020.
  63. Deferred neural rendering: image synthesis using neural textures. ACM Transactions on Graphics (TOG), 38, 2019.
  64. Neural-gif: Neural generalized implicit functions for animating people in clothing. In ICCV, 2021.
  65. Learning from synthetic humans. CVPR, pages 4627–4635, 2017.
  66. Metaavatar: Learning animatable clothed human models from few depth images. NeurIPS, 2021.
  67. Arah: Animatable volume rendering of articulated human sdfs. In European Conference on Computer Vision, 2022.
  68. Video-to-video synthesis. In NeurIPS, 2018.
  69. High-resolution image synthesis and semantic manipulation with conditional gans. In CVPR, 2018.
  70. Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing, 13:600–612, 2004.
  71. Humannerf: Free-viewpoint rendering of moving people from monocular video. ArXiv, abs/2201.04127, 2022.
  72. Video-based characters: creating new human performances from a multi-view video database. ACM SIGGRAPH, 2011.
  73. Learning motion-dependent appearance for high-fidelity rendering of dynamic humans from a single camera. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3397–3407, 2022.
  74. The unreasonable effectiveness of deep features as a perceptual metric. CVPR, pages 586–595, 2018.
  75. Deepmulticap: Performance capture of multiple characters using sparse multiview cameras. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6239–6249, 2021.
  76. Pamir: Parametric model-conditioned implicit representation for image-based human reconstruction. TPAMI, PP, 2021.
  77. Cips-3d: A 3d-aware generator of gans based on conditionally-independent pixel synthesis. ArXiv, 2021.
  78. Progressive pose attention transfer for person image generation. In CVPR, pages 2347–2356, 2019.
Citations (5)

Summary

  • The paper introduces a surface-based 4D motion model that jointly captures spatial and temporal dynamics for dynamic human rendering.
  • It employs a surface-based triplane representation and physical motion decoding to efficiently synthesize high-fidelity novel views from sparse multi-view data.
  • Empirical evaluations demonstrate SurMo's superiority over state-of-the-art methods in resolving fast motion and lighting-dependent shadowing effects.

SurMo: A New Paradigm for Dynamic Human Rendering Leveraging Surface-based 4D Motion Modeling

Introduction

In the paper titled "SurMo: Surface-based 4D Motion Modeling for Dynamic Human Rendering," Tao Hu, Fangzhou Hong, and Ziwei Liu introduce a novel framework aimed at synthesizing dynamic human figures from sparse multi-view video data. By addressing the limitations present in existing methodologies, which primarily focus on static pose conditioning and lack an effective mechanism for capturing and leveraging temporal dynamics, this work proposes a structured approach to integrate both the spatial and temporal aspects of motion for enhanced human image synthesis.

Key Contributions

  • Innovative Paradigm: The SurMo framework stands out by jointly modeling the temporal dynamics alongside the human appearance within a unified schema, utilizing a surface-based motion representation which is distinct from traditional volumetric or pose-guided representations.
  • Surface-based Triplane Representation: At the core of SurMo lies the surface-based triplane representation, which efficiently encodes both spatial and temporal motion aspects on the dense surface manifold of a statistical body template. This compact formulation significantly contributes to the framework's ability to generalize novel view synthesis with sparse training observations.
  • Physical Motion Decoding: The framework introduces a physical motion decoding strategy focused on encouraging learning of physical motion by predicting both spatial and temporal derivatives for the next time step during training. This approach advances the understanding and manipulation of temporal clothing offsets and secondary motion dynamics which are critical for realistic rendering.
  • 4D Appearance Decoding and Optimization: Leveraging an efficacious volumetric surface-conditioned renderer coupled with a geometry-aware super-resolution mechanism, SurMo efficiently renders high-quality images conditioned on dynamic inputs. The optimization strategy integrates multiple losses, including adversarial, perceptual, and face identity losses, ensuring high fidelity in the final output.

Empirical Evaluation

SurMo's performance was rigorously evaluated against several state-of-the-art methods across three datasets with varying dynamics and complexities. The quantitative assessments demonstrate SurMo's superiority in handling novel-view synthesis, fast motion sequences, and motion-dependent shadowing effect, establishing new benchmarks in dynamic human rendering.

  • Quantitative Analysis: Across all evaluated datasets and metrics, SurMo consistently outperformed existing approaches like Neural Body, HumanNeRF, Instant-NVR, among others, by notable margins. This highlights SurMo's effectiveness in synthesizing time-varying appearances with high fidelity.
  • Qualitative Observations: Besides the numerical improvements, qualitative inspections reveal SurMo's adeptness at capturing and rendering fine-grained details such as clothing wrinkles and motion-dependent shadows under various lighting conditions and actions, aspects where other methods falter.

Future Directions and Implications

The SurMo framework introduces a paradigm shift in dynamic human rendering, emphasizing the critical role of surface-based motion modeling. Its capability to precisely capture and render dynamic human figures from sparse observations holds significant promise for applications across virtual reality, digital entertainment, and beyond.

Future work may explore further advancements in physical motion decoding techniques and the adaptation of the surface-based triplane representation to encompass a wider range of motion dynamics. Additionally, the potential for real-time rendering and the adaptation to varying textures and clothing types present exciting avenues for research and application development.

Conclusion

By effectively synthesizing dynamic human figures from limited observational data, SurMo represents a significant step forward in the field of human rendering. Its innovative use of a surface-based triplane model for motion representation, coupled with a holistic modeling of temporal dynamics, sets a new standard for realism and efficiency in the field.

Github Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com