GenLayNeRF: Generalizable Layered Representations with 3D Model Alignment for Multi-Human View Synthesis (2309.11627v1)
Abstract: Novel view synthesis (NVS) of multi-human scenes imposes challenges due to the complex inter-human occlusions. Layered representations handle the complexities by dividing the scene into multi-layered radiance fields, however, they are mainly constrained to per-scene optimization making them inefficient. Generalizable human view synthesis methods combine the pre-fitted 3D human meshes with image features to reach generalization, yet they are mainly designed to operate on single-human scenes. Another drawback is the reliance on multi-step optimization techniques for parametric pre-fitting of the 3D body models that suffer from misalignment with the images in sparse view settings causing hallucinations in synthesized views. In this work, we propose, GenLayNeRF, a generalizable layered scene representation for free-viewpoint rendering of multiple human subjects which requires no per-scene optimization and very sparse views as input. We divide the scene into multi-human layers anchored by the 3D body meshes. We then ensure pixel-level alignment of the body models with the input views through a novel end-to-end trainable module that carries out iterative parametric correction coupled with multi-view feature fusion to produce aligned 3D models. For NVS, we extract point-wise image-aligned and human-anchored features which are correlated and fused using self-attention and cross-attention modules. We augment low-level RGB values into the features with an attention-based RGB fusion module. To evaluate our approach, we construct two multi-human view synthesis datasets; DeepMultiSyn and ZJU-MultiHuman. The results indicate that our proposed approach outperforms generalizable and non-human per-scene NeRF methods while performing at par with layered per-scene methods without test time optimization.
- Neural point-based graphics. ArXiv, abs/1906.08240, 2020.
- Keep it smpl: Automatic estimation of 3d human pose and shape from a single image. ArXiv, abs/1607.08128, 2016.
- Generalizable neural performer: Learning robust radiance fields for human novel view synthesis. ArXiv, abs/2204.11798, 2022.
- Stereo radiance fields (srf): Learning view synthesis from sparse views of novel scenes. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2021.
- Deepview: View synthesis with learned gradient descent. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2362–2371, 2019.
- Portrait neural radiance fields from a single image. ArXiv, abs/2012.05903, 2020.
- 3d semantic segmentation with submanifold sparse convolutional networks. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9224–9232, 2018.
- Estimating human shape and pose from a single image. 2009 IEEE 12th International Conference on Computer Vision, pages 1381–1388, 2009.
- Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
- Pare: Part attention regressor for 3d human body estimation. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 11107–11117, 2021.
- Appearance consensus driven self-supervised human mesh recovery. ArXiv, abs/2008.01341, 2020.
- Neural human performer: Learning generalizable radiance fields for human performance rendering. In NeurIPS, 2021.
- Mine: Towards continuous depth mpi with nerf for novel view synthesis. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 12558–12568, 2021a.
- Self-correction for human parsing. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020.
- Neural 3d video synthesis. ArXiv, abs/2103.02597, 2021b.
- End-to-end human pose and mesh reconstruction with transformers. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1954–1963, 2020.
- Mesh graphormer. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 12919–12928, 2021.
- SMPL: A skinned multi-person linear model. ACM Trans. Graphics (Proc. SIGGRAPH Asia), 34(6):248:1–248:16, 2015.
- Layered neural rendering for retiming people in video. ACM Transactions on Graphics (TOG), 39:1 – 14, 2020.
- Keypointnerf: Generalizing image-based volumetric avatars using relative spatial encoding of keypoints. In European Conference on Computer Vision, 2022.
- Nerf: Representing scenes as neural radiance fields for view synthesis. In ECCV, 2020.
- Learning deconvolution network for semantic segmentation. 2015 IEEE International Conference on Computer Vision (ICCV), pages 1520–1528, 2015.
- Deformable neural radiance fields. https://arxiv.org/abs/2011.12948, 2020.
- Expressive body capture: 3D hands, face, and body from a single image. In Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 10975–10985, 2019.
- Neural body: Implicit neural representations with structured latent codes for novel view synthesis of dynamic humans. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9050–9059, 2021.
- D-NeRF: Neural radiance fields for dynamic scenes. https://arxiv.org/abs/2011.13961, 2020.
- Pifu: Pixel-aligned implicit function for high-resolution clothed human digitization. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 2304–2314, 2019.
- Pifuhd: Multi-level pixel-aligned implicit function for high-resolution 3d human digitization. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 81–90, 2020.
- Easymocap - make human motion capture easier. Github, 2021.
- Novel view synthesis of human interactions from sparse multi-view videos. ACM SIGGRAPH, 2022.
- Combined discriminative and generative articulated pose and non-rigid shape estimation. In NIPS, 2007.
- Deepvoxels: Learning persistent 3d feature embeddings. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2432–2441, 2019.
- Deferred neural rendering: Image synthesis using neural textures. arXiv: Computer Vision and Pattern Recognition, 2019.
- Grf: Learning a general radiance field for 3d scene representation and rendering. ArXiv, abs/2010.04595, 2020.
- Ibrnet: Learning multi-view image-based rendering. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4688–4697, 2021.
- Multi-view neural human rendering. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1679–1688, 2020.
- Space-time neural irradiance fields for free-viewpoint video. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9416–9426, 2021.
- Perspective transformer nets: Learning single-view 3d object reconstruction without 3d supervision. ArXiv, abs/1612.00814, 2016.
- pixelnerf: Neural radiance fields from one or few images. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4576–4585, 2021.
- Weakly supervised 3d human pose and shape reconstruction with normalizing flows. ArXiv, abs/2003.10350, 2020.
- Pymaf: 3d human pose and shape regression with pyramidal mesh alignment feedback loop. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 11426–11436, 2021a.
- Editable free-viewpoint video using a layered neural representation. ACM Transactions on Graphics, 2021b.
- Lightweight multi-person total motion capture using sparse multi-view cameras. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 5540–5549, 2021c.
- Humannerf: Generalizable neural human radiance field from sparse inputs. ArXiv, 2021.
- Deepmulticap: Performance capture of multiple characters using sparse multiview cameras. International Conference on Computer Vision (ICCV), 2021.