SurMo: Surface-based 4D Motion Modeling for Dynamic Human Rendering (2404.01225v2)
Abstract: Dynamic human rendering from video sequences has achieved remarkable progress by formulating the rendering as a mapping from static poses to human images. However, existing methods focus on the human appearance reconstruction of every single frame while the temporal motion relations are not fully explored. In this paper, we propose a new 4D motion modeling paradigm, SurMo, that jointly models the temporal dynamics and human appearances in a unified framework with three key designs: 1) Surface-based motion encoding that models 4D human motions with an efficient compact surface-based triplane. It encodes both spatial and temporal motion relations on the dense surface manifold of a statistical body template, which inherits body topology priors for generalizable novel view synthesis with sparse training observations. 2) Physical motion decoding that is designed to encourage physical motion learning by decoding the motion triplane features at timestep t to predict both spatial derivatives and temporal derivatives at the next timestep t+1 in the training stage. 3) 4D appearance decoding that renders the motion triplanes into images by an efficient volumetric surface-conditioned renderer that focuses on the rendering of body surfaces with motion learning conditioning. Extensive experiments validate the state-of-the-art performance of our new paradigm and illustrate the expressiveness of surface-based motion triplanes for rendering high-fidelity view-consistent humans with fast motions and even motion-dependent shadows. Our project page is at: https://taohuumd.github.io/projects/SurMo/
- Deep video‐based performance cloning. Computer Graphics Forum, 38, 2019.
- Universal capture: image-based facial animation for ”the matrix reloaded”. In SIGGRAPH ’03, 2003.
- Free-viewpoint video of human actors. ACM SIGGRAPH 2003 Papers, 2003.
- Everybody dance now. ICCV, pages 5932–5941, 2019.
- Efficient geometry-aware 3d generative adversarial networks. ArXiv, abs/2112.07945, 2021.
- Efficient geometry-aware 3d generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16123–16133, 2022.
- Animatable neural radiance fields from monocular rgb video. ArXiv, abs/2106.13629, 2021.
- Learning implicit fields for generative shape modeling. CVPR, 2019.
- Nasa neural articulated shape approximation. In ECCV, 2020.
- Learning neural volumetric representations of dynamic humans in minutes. In CVPR, 2023.
- Generative adversarial nets. In NIPS, 2014.
- Coordinate-based texture inpainting for pose-guided human image generation. CVPR, pages 12127–12136, 2019.
- Stylenerf: A style-based 3d-aware generator for high-resolution image synthesis. ArXiv, abs/2110.08985, 2021.
- Real-time deep dynamic characters. ACM Transactions on Graphics (TOG), 40:1 – 16, 2021.
- Deep residual learning for image recognition. CVPR, pages 770–778, 2016.
- Gans trained by a two time-scale update rule converge to a local nash equilibrium. In NIPS, 2017.
- Headnerf: A real-time nerf-based parametric head model. ArXiv, abs/2112.05637, 2021.
- Learning to generate dense point clouds with textures on multiple categories. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 2170–2179, January 2021.
- Egorenderer: Rendering human avatars from egocentric camera images. In ICCV, 2021.
- Hvtr++: Image and pose driven human avatars using hybrid volumetric-textural rendering. IEEE Transactions on Visualization and Computer Graphics, pages 1–15, 2023.
- Hvtr: Hybrid volumetric-textural rendering for human avatars. 3DV, 2022.
- Arch: Animatable reconstruction of clothed humans. 2020 (CVPR), pages 3090–3099, 2020.
- Image-to-image translation with conditional adversarial networks. CVPR, pages 5967–5976, 2017.
- Perceptual losses for real-time style transfer and super-resolution. volume 9906, pages 694–711, 10 2016.
- Ray tracing volume densities. Proceedings of the 11th annual conference on Computer graphics and interactive techniques, 1984.
- Adam: A method for stochastic optimization. In ICLR, 2015.
- Towards an understanding of our world by GANing videos in the wild. arXiv:1711.11453, 2017.
- Learn to dance with aist++: Music conditioned 3d dance generation, 2021.
- Posevocab: Learning joint-structured pose embeddings for human avatar modeling. In ACM SIGGRAPH Conference Proceedings, 2023.
- Neural actor: Neural free-view synthesis of human actors with pose control. TOG, 40, 2021.
- Neural human video rendering by learning dynamic textures and rendering-to-video translation. IEEE Transactions on Visualization and Computer Graphics, 05 2020.
- Neural rendering and reenactment of human actor videos. ACM Transactions on Graphics (TOG), 2019.
- Sphereface: Deep hypersphere embedding for face recognition. CVPR, pages 6738–6746, 2017.
- Smpl: a skinned multi-person linear model. ACM Trans. Graph., 34:248:1–16, 2015.
- Pose guided person image generation. In NeurIPS, pages 405–415, 2017.
- Disentangled person image generation. CVPR, 2018.
- Scale: Modeling clothed humans with a surface codec of articulated local elements. In CVPR, 2021.
- The power of points for modeling humans in clothing. In ICCV, 2021.
- Occupancy networks: Learning 3d reconstruction in function space. CVPR, 2019.
- Deep level sets: Implicit surface representations for 3d shape inference. ArXiv, 2019.
- Leap: Learning articulated occupancy of people. In CVPR, 2021.
- Dense pose transfer. ECCV, 2018.
- Giraffe: Representing scenes as compositional generative neural feature fields. CVPR, pages 11448–11459, 2021.
- Neural articulated radiance field. In IEEE/CVF ICCV, 2021.
- Stylesdf: High-resolution 3d-consistent image and geometry generation. ArXiv, abs/2112.11427, 2021.
- Npms: Neural parametric models for 3d deformable shapes. In IEEE/CVF ICCV, 2021.
- Deepsdf: Learning continuous signed distance functions for shape representation. CVPR, 2019.
- Deepsdf: Learning continuous signed distance functions for shape representation. 2019 (CVPR), pages 165–174, 2019.
- Animatable neural radiance fields for modeling dynamic human bodies. In ICCV, 2021.
- Neural body: Implicit neural representations with structured latent codes for novel view synthesis of dynamic humans. CVPR, 2021.
- Smplpix: Neural avatars from 3d human models. WACV, 2021.
- Unsupervised person image synthesis in arbitrary poses. In CVPR, June 2018.
- Anr: Articulated neural rendering for virtual avatars. CVPR, pages 3721–3730, 2021.
- Drivable volumetric avatars using texel-aligned features. ACM SIGGRAPH, 2022.
- Pifu: Pixel-aligned implicit function for high-resolution clothed human digitization. IEEE/CVF ICCV, pages 2304–2314, 2019.
- Pifuhd: Multi-level pixel-aligned implicit function for high-resolution 3d human digitization. 2020 (CVPR), pages 81–90, 2020.
- Scanimate: Weakly supervised learning of skinned clothed avatar networks. 2021 (CVPR), pages 2885–2896, 2021.
- Neural re-rendering of humans from a single image. In ECCV, 2020.
- Deformable GANs for pose-based human image generation. In CVPR, 2018.
- Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2015.
- A-nerf: Articulated neural radiance fields for learning human shape, appearance, and pose. In NeurIPS, 2021.
- State of the art on neural rendering. Computer Graphics Forum, 2020.
- Deferred neural rendering: image synthesis using neural textures. ACM Transactions on Graphics (TOG), 38, 2019.
- Neural-gif: Neural generalized implicit functions for animating people in clothing. In ICCV, 2021.
- Learning from synthetic humans. CVPR, pages 4627–4635, 2017.
- Metaavatar: Learning animatable clothed human models from few depth images. NeurIPS, 2021.
- Arah: Animatable volume rendering of articulated human sdfs. In European Conference on Computer Vision, 2022.
- Video-to-video synthesis. In NeurIPS, 2018.
- High-resolution image synthesis and semantic manipulation with conditional gans. In CVPR, 2018.
- Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing, 13:600–612, 2004.
- Humannerf: Free-viewpoint rendering of moving people from monocular video. ArXiv, abs/2201.04127, 2022.
- Video-based characters: creating new human performances from a multi-view video database. ACM SIGGRAPH, 2011.
- Learning motion-dependent appearance for high-fidelity rendering of dynamic humans from a single camera. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3397–3407, 2022.
- The unreasonable effectiveness of deep features as a perceptual metric. CVPR, pages 586–595, 2018.
- Deepmulticap: Performance capture of multiple characters using sparse multiview cameras. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6239–6249, 2021.
- Pamir: Parametric model-conditioned implicit representation for image-based human reconstruction. TPAMI, PP, 2021.
- Cips-3d: A 3d-aware generator of gans based on conditionally-independent pixel synthesis. ArXiv, 2021.
- Progressive pose attention transfer for person image generation. In CVPR, pages 2347–2356, 2019.