Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

GenLayNeRF: Generalizable Layered Representations with 3D Model Alignment for Multi-Human View Synthesis (2309.11627v1)

Published 20 Sep 2023 in cs.CV

Abstract: Novel view synthesis (NVS) of multi-human scenes imposes challenges due to the complex inter-human occlusions. Layered representations handle the complexities by dividing the scene into multi-layered radiance fields, however, they are mainly constrained to per-scene optimization making them inefficient. Generalizable human view synthesis methods combine the pre-fitted 3D human meshes with image features to reach generalization, yet they are mainly designed to operate on single-human scenes. Another drawback is the reliance on multi-step optimization techniques for parametric pre-fitting of the 3D body models that suffer from misalignment with the images in sparse view settings causing hallucinations in synthesized views. In this work, we propose, GenLayNeRF, a generalizable layered scene representation for free-viewpoint rendering of multiple human subjects which requires no per-scene optimization and very sparse views as input. We divide the scene into multi-human layers anchored by the 3D body meshes. We then ensure pixel-level alignment of the body models with the input views through a novel end-to-end trainable module that carries out iterative parametric correction coupled with multi-view feature fusion to produce aligned 3D models. For NVS, we extract point-wise image-aligned and human-anchored features which are correlated and fused using self-attention and cross-attention modules. We augment low-level RGB values into the features with an attention-based RGB fusion module. To evaluate our approach, we construct two multi-human view synthesis datasets; DeepMultiSyn and ZJU-MultiHuman. The results indicate that our proposed approach outperforms generalizable and non-human per-scene NeRF methods while performing at par with layered per-scene methods without test time optimization.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (45)
  1. Neural point-based graphics. ArXiv, abs/1906.08240, 2020.
  2. Keep it smpl: Automatic estimation of 3d human pose and shape from a single image. ArXiv, abs/1607.08128, 2016.
  3. Generalizable neural performer: Learning robust radiance fields for human novel view synthesis. ArXiv, abs/2204.11798, 2022.
  4. Stereo radiance fields (srf): Learning view synthesis from sparse views of novel scenes. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2021.
  5. Deepview: View synthesis with learned gradient descent. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2362–2371, 2019.
  6. Portrait neural radiance fields from a single image. ArXiv, abs/2012.05903, 2020.
  7. 3d semantic segmentation with submanifold sparse convolutional networks. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9224–9232, 2018.
  8. Estimating human shape and pose from a single image. 2009 IEEE 12th International Conference on Computer Vision, pages 1381–1388, 2009.
  9. Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
  10. Pare: Part attention regressor for 3d human body estimation. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 11107–11117, 2021.
  11. Appearance consensus driven self-supervised human mesh recovery. ArXiv, abs/2008.01341, 2020.
  12. Neural human performer: Learning generalizable radiance fields for human performance rendering. In NeurIPS, 2021.
  13. Mine: Towards continuous depth mpi with nerf for novel view synthesis. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 12558–12568, 2021a.
  14. Self-correction for human parsing. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020.
  15. Neural 3d video synthesis. ArXiv, abs/2103.02597, 2021b.
  16. End-to-end human pose and mesh reconstruction with transformers. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1954–1963, 2020.
  17. Mesh graphormer. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 12919–12928, 2021.
  18. SMPL: A skinned multi-person linear model. ACM Trans. Graphics (Proc. SIGGRAPH Asia), 34(6):248:1–248:16, 2015.
  19. Layered neural rendering for retiming people in video. ACM Transactions on Graphics (TOG), 39:1 – 14, 2020.
  20. Keypointnerf: Generalizing image-based volumetric avatars using relative spatial encoding of keypoints. In European Conference on Computer Vision, 2022.
  21. Nerf: Representing scenes as neural radiance fields for view synthesis. In ECCV, 2020.
  22. Learning deconvolution network for semantic segmentation. 2015 IEEE International Conference on Computer Vision (ICCV), pages 1520–1528, 2015.
  23. Deformable neural radiance fields. https://arxiv.org/abs/2011.12948, 2020.
  24. Expressive body capture: 3D hands, face, and body from a single image. In Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 10975–10985, 2019.
  25. Neural body: Implicit neural representations with structured latent codes for novel view synthesis of dynamic humans. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9050–9059, 2021.
  26. D-NeRF: Neural radiance fields for dynamic scenes. https://arxiv.org/abs/2011.13961, 2020.
  27. Pifu: Pixel-aligned implicit function for high-resolution clothed human digitization. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 2304–2314, 2019.
  28. Pifuhd: Multi-level pixel-aligned implicit function for high-resolution 3d human digitization. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 81–90, 2020.
  29. Easymocap - make human motion capture easier. Github, 2021.
  30. Novel view synthesis of human interactions from sparse multi-view videos. ACM SIGGRAPH, 2022.
  31. Combined discriminative and generative articulated pose and non-rigid shape estimation. In NIPS, 2007.
  32. Deepvoxels: Learning persistent 3d feature embeddings. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2432–2441, 2019.
  33. Deferred neural rendering: Image synthesis using neural textures. arXiv: Computer Vision and Pattern Recognition, 2019.
  34. Grf: Learning a general radiance field for 3d scene representation and rendering. ArXiv, abs/2010.04595, 2020.
  35. Ibrnet: Learning multi-view image-based rendering. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4688–4697, 2021.
  36. Multi-view neural human rendering. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1679–1688, 2020.
  37. Space-time neural irradiance fields for free-viewpoint video. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9416–9426, 2021.
  38. Perspective transformer nets: Learning single-view 3d object reconstruction without 3d supervision. ArXiv, abs/1612.00814, 2016.
  39. pixelnerf: Neural radiance fields from one or few images. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4576–4585, 2021.
  40. Weakly supervised 3d human pose and shape reconstruction with normalizing flows. ArXiv, abs/2003.10350, 2020.
  41. Pymaf: 3d human pose and shape regression with pyramidal mesh alignment feedback loop. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 11426–11436, 2021a.
  42. Editable free-viewpoint video using a layered neural representation. ACM Transactions on Graphics, 2021b.
  43. Lightweight multi-person total motion capture using sparse multi-view cameras. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 5540–5549, 2021c.
  44. Humannerf: Generalizable neural human radiance field from sparse inputs. ArXiv, 2021.
  45. Deepmulticap: Performance capture of multiple characters using sparse multiview cameras. International Conference on Computer Vision (ICCV), 2021.

Summary

We haven't generated a summary for this paper yet.