OccGaussian: 3D Gaussian Splatting for Occluded Human Rendering (2404.08449v3)
Abstract: Rendering dynamic 3D human from monocular videos is crucial for various applications such as virtual reality and digital entertainment. Most methods assume the people is in an unobstructed scene, while various objects may cause the occlusion of body parts in real-life scenarios. Previous method utilizing NeRF for surface rendering to recover the occluded areas, but it requiring more than one day to train and several seconds to render, failing to meet the requirements of real-time interactive applications. To address these issues, we propose OccGaussian based on 3D Gaussian Splatting, which can be trained within 6 minutes and produces high-quality human renderings up to 160 FPS with occluded input. OccGaussian initializes 3D Gaussian distributions in the canonical space, and we perform occlusion feature query at occluded regions, the aggregated pixel-align feature is extracted to compensate for the missing information. Then we use Gaussian Feature MLP to further process the feature along with the occlusion-aware loss functions to better perceive the occluded area. Extensive experiments both in simulated and real-world occlusions, demonstrate that our method achieves comparable or even superior performance compared to the state-of-the-art method. And we improving training and inference speeds by 250x and 800x, respectively. Our code will be available for research purposes.
- Neural point-based graphics. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXII 16. Springer, 696–712.
- Mip-nerf: A multiscale representation for anti-aliasing neural radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 5855–5864.
- Mip-nerf 360: Unbounded anti-aliased neural radiance fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5470–5479.
- GM-NeRF: Learning Generalizable Model-based Neural Radiance Fields from Multi-view Images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 20648–20658.
- Geometry-guided progressive nerf for generalizable and efficient neural human rendering. In European Conference on Computer Vision. Springer, 222–239.
- Snarf: Differentiable forward skinning for animating non-rigid neural implicit shapes. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 11594–11604.
- UV Volumes for real-time rendering of editable free-view human performance. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16621–16631.
- Plenoxels: Radiance fields without neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5501–5510.
- Learning neural volumetric representations of dynamic humans in minutes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8759–8770.
- Markus Gross and Hanspeter Pfister. 2011. Point-based graphics. Elsevier.
- Densepose: Dense human pose estimation in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7297–7306.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778.
- Gaussianavatar: Towards realistic human avatar modeling from a single video via animatable 3d gaussians. arXiv preprint arXiv:2312.02134 (2023).
- SHERF: Generalizable Human NeRF from a Single Image. arXiv preprint arXiv:2303.12791 (2023).
- Shoukang Hu and Ziwei Liu. 2023. Gauhuman: Articulated gaussian splatting from monocular human videos. arXiv preprint arXiv:2312.02973 (2023).
- Occluded Human Body Capture with Self-Supervised Spatial-Temporal Motion Prior. arXiv preprint arXiv:2207.05375 (2022).
- Instantavatar: Learning avatars from monocular video in 60 seconds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16922–16932.
- Hifi4g: High-fidelity human performance rendering via compact gaussian splatting. arXiv preprint arXiv:2312.03461 (2023).
- 3D Gaussian Splatting for Real-Time Radiance Field Rendering. ACM Transactions on Graphics 42, 4 (2023).
- Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
- Hugs: Human gaussian splats. arXiv preprint arXiv:2311.17910 (2023).
- Neural human performer: Learning generalizable radiance fields for human performance rendering. Advances in Neural Information Processing Systems 34 (2021), 24741–24752.
- Gart: Gaussian articulated template models. arXiv preprint arXiv:2311.16099 (2023).
- Human101: Training 100+ fps human gaussians in 100s from 1 view. arXiv preprint arXiv:2312.15258 (2023).
- Learning a model of facial shape and expression from 4D scans. ACM Trans. Graph. 36, 6 (2017), 194–1.
- Animatable gaussians: Learning pose-dependent gaussian maps for high-fidelity human avatar modeling. arXiv preprint arXiv:2311.16096 (2023).
- Posynda: Multi-hypothesis pose synthesis domain adaptation for robust 3d human pose estimation. In Proceedings of the 31st ACM International Conference on Multimedia. 5542–5551.
- SMPL: A skinned multi-person linear model. In Seminal Graphics Papers: Pushing the Boundaries, Volume 2. 851–866.
- Dynamic 3d gaussians: Tracking by persistent dynamic view synthesis. arXiv preprint arXiv:2308.09713 (2023).
- KeypointNeRF: Generalizing image-based volumetric avatars using relative spatial encoding of keypoints. In European conference on computer vision. Springer, 179–197.
- Nerf: Representing scenes as neural radiance fields for view synthesis. Commun. ACM 65, 1 (2021), 99–106.
- Human gaussian splatting: Real-time rendering of animatable avatars. arXiv preprint arXiv:2311.17113 (2023).
- Instant neural graphics primitives with a multiresolution hash encoding. ACM Transactions on Graphics (ToG) 41, 4 (2022), 1–15.
- Transhuman: A transformer-based human representation for generalizable neural human rendering. In Proceedings of the IEEE/CVF International conference on computer vision. 3544–3555.
- Ash: Animatable gaussian splats for efficient and photoreal human rendering. arXiv preprint arXiv:2312.05941 (2023).
- Expressive body capture: 3d hands, face, and body from a single image. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10975–10985.
- Animatable neural radiance fields for modeling dynamic human bodies. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 14314–14323.
- Neural body: Implicit neural representations with structured latent codes for novel view synthesis of dynamic humans. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9054–9063.
- Gaussianavatars: Photorealistic head avatars with rigged 3d gaussians. arXiv preprint arXiv:2312.02069 (2023).
- 3dgs-avatar: Animatable avatars via deformable 3d gaussian splatting. arXiv preprint arXiv:2312.09228 (2023).
- Embodied hands: Modeling and capturing hands and bodies together. arXiv preprint arXiv:2201.02610 (2022).
- Adop: Approximate differentiable one-pixel point rendering. ACM Transactions on Graphics (ToG) 41, 4 (2022), 1–14.
- Johannes L Schonberger and Jan-Michael Frahm. 2016. Structure-from-motion revisited. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4104–4113.
- Refu: Refine and fuse the unobserved view for detail-preserving single-image 3d human reconstruction. In Proceedings of the 30th ACM International Conference on Multimedia. 6850–6859.
- NPC: Neural Point Characters from Video. arXiv preprint arXiv:2304.02013 (2023).
- Neural free-viewpoint performance rendering under complex human-object interactions. In Proceedings of the 29th ACM International Conference on Multimedia. 4651–4660.
- Attention is all you need. Advances in neural information processing systems 30 (2017).
- Lightweight Super-Resolution Head for Human Pose Estimation. In Proceedings of the 31st ACM International Conference on Multimedia. 2353–2361.
- ibutter: Neural interactive bullet time generator for human free-viewpoint rendering. In Proceedings of the 29th ACM International Conference on Multimedia. 4641–4650.
- Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13, 4 (2004), 600–612.
- Humannerf: Free-viewpoint rendering of moving people from monocular video. In Proceedings of the IEEE/CVF conference on computer vision and pattern Recognition. 16210–16220.
- PersonNeRF: Personalized Reconstruction from Photo Collections. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 524–533.
- 4d gaussian splatting for real-time dynamic scene rendering. arXiv preprint arXiv:2310.08528 (2023).
- Wild2Avatar: Rendering Humans Behind Occlusions. arXiv preprint arXiv:2401.00431 (2023).
- Rendering humans from object-occluded monocular videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 3239–3250.
- Ghum & ghuml: Generative 3d human shape and articulated pose models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6184–6193.
- Point-nerf: Point-based neural radiance fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5438–5448.
- Gaussian head avatar: Ultra high-fidelity head avatar via dynamic gaussians. arXiv preprint arXiv:2312.03029 (2023).
- LatentAvatar: Learning Latent Expression Code for Expressive Neural Head Avatar. arXiv preprint arXiv:2305.01190 (2023).
- LASOR: Learning accurate 3D human pose and shape via synthetic occlusion-aware data and neural mesh rendering. IEEE Transactions on Image Processing 31 (2022), 1938–1948.
- Deformable 3d gaussians for high-fidelity monocular dynamic scene reconstruction. arXiv preprint arXiv:2309.13101 (2023).
- S3: Neural shape, skeleton, and skinning fields for 3d human modeling. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 13284–13293.
- Differentiable surface splatting for point-based geometry processing. ACM Transactions on Graphics (TOG) 38, 6 (2019), 1–14.
- MonoHuman: Animatable Human Neural Field from Monocular Video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16943–16953.
- Gavatar: Animatable 3d gaussian avatars with implicit mesh learning. arXiv preprint arXiv:2312.11461 (2023).
- Mutual Adaptive Reasoning for Monocular 3D Multi-Person Pose Estimation. In Proceedings of the 30th ACM International Conference on Multimedia. 1788–1796.
- Vmrf: View matching neural radiance fields. In Proceedings of the 30th ACM International Conference on Multimedia. 6579–6587.
- The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition. 586–595.
- Humannerf: Efficiently generated human radiance field from sparse inputs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7743–7753.
- Gps-gaussian: Generalizable pixel-wise 3d gaussian splatting for real-time human novel view synthesis. arXiv preprint arXiv:2312.02155 (2023).
- Im avatar: Implicit morphable head avatars from videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13545–13555.
- Pointavatar: Deformable point-based head avatars from videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 21057–21067.
- Human de-occlusion: Invisible perception and recovery for humans. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3691–3701.