Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
184 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

BakedAvatar: Baking Neural Fields for Real-Time Head Avatar Synthesis (2311.05521v2)

Published 9 Nov 2023 in cs.GR and cs.CV

Abstract: Synthesizing photorealistic 4D human head avatars from videos is essential for VR/AR, telepresence, and video game applications. Although existing Neural Radiance Fields (NeRF)-based methods achieve high-fidelity results, the computational expense limits their use in real-time applications. To overcome this limitation, we introduce BakedAvatar, a novel representation for real-time neural head avatar synthesis, deployable in a standard polygon rasterization pipeline. Our approach extracts deformable multi-layer meshes from learned isosurfaces of the head and computes expression-, pose-, and view-dependent appearances that can be baked into static textures for efficient rasterization. We thus propose a three-stage pipeline for neural head avatar synthesis, which includes learning continuous deformation, manifold, and radiance fields, extracting layered meshes and textures, and fine-tuning texture details with differential rasterization. Experimental results demonstrate that our representation generates synthesis results of comparable quality to other state-of-the-art methods while significantly reducing the inference time required. We further showcase various head avatar synthesis results from monocular videos, including view synthesis, face reenactment, expression editing, and pose editing, all at interactive frame rates.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (67)
  1. HyperReel: High-Fidelity 6-DoF Video With Ray-Conditioned Sampling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16610–16620.
  2. Matan Atzmon and Yaron Lipman. 2020. Sal: Sign agnostic learning of shapes from raw data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2565–2574.
  3. Learning Personalized High Quality Volumetric Head Avatars From Monocular RGB Videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16890–16900.
  4. Blazeface: Sub-millisecond neural face detection on mobile gpus. arXiv preprint arXiv:1907.05047 (2019).
  5. Facewarehouse: A 3d facial expression database for visual computing. IEEE Transactions on Visualization and Computer Graphics 20, 3 (2013), 413–425.
  6. Mvsnerf: Fast generalizable radiance field reconstruction from multi-view stereo. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 14124–14133.
  7. Fast-SNARF: A fast deformer for articulated neural fields. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023).
  8. Snarf: Differentiable forward skinning for animating non-rigid neural implicit shapes. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 11594–11604.
  9. MobileNeRF: Exploiting the Polygon Rasterization Pipeline for Efficient Neural Field Rendering on Mobile Architectures. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16569–16578.
  10. Gram: Generative radiance manifolds for 3d-aware image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10673–10683.
  11. Learning an Animatable Detailed 3D Face Model from In-the-Wild Images. ACM Transactions on Graphics (ToG), Proc. SIGGRAPH 40, 4 (Aug. 2021), 88:1–88:13.
  12. Learning an Animatable Detailed 3D Face Model from In-The-Wild Images. ACM Transactions on Graphics, (Proc. SIGGRAPH) 40, 8. https://doi.org/10.1145/3450626.3459936
  13. Plenoxels: Radiance fields without neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5501–5510.
  14. Dynamic neural radiance fields for monocular 4d facial avatar reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8649–8658.
  15. Reconstructing personalized semantic facial nerf models from monocular video. ACM Transactions on Graphics (TOG) 41, 6 (2022), 1–12.
  16. Fastnerf: High-fidelity neural rendering at 200fps. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 14346–14355.
  17. Michael Garland and Paul S Heckbert. 1997. Surface simplification using quadric error metrics. In Proceedings of the 24th annual conference on Computer graphics and interactive techniques. 209–216.
  18. Neural head avatars from monocular RGB videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18653–18664.
  19. VMesh: Hybrid Volume-Mesh Representation for Efficient View Synthesis. arXiv preprint arXiv:2303.16184 (2023).
  20. Baking neural radiance fields for real-time view synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 5875–5884.
  21. Perceptual losses for real-time style transfer and super-resolution. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14. Springer, 694–711.
  22. jpcy. 2023. xatlas. https://github.com/jpcy/xatlas
  23. Modnet: Real-time trimap-free portrait matting via objective decomposition. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 1140–1147.
  24. Realistic One-shot Mesh-based Head Avatars. In European Conference of Computer vision (ECCV).
  25. Deep Video Portraits. ACM Transactions on Graphics (TOG) 37, 4 (2018), 163.
  26. Diederik Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In International Conference on Learning Representations (ICLR). San Diega, CA, USA.
  27. Modular Primitives for High-Performance Differentiable Rendering. ACM Transactions on Graphics 39, 6 (2020).
  28. Learning a model of facial shape and expression from 4D scans. ACM Trans. Graph. 36, 6 (2017), 194–1.
  29. Efficient Neural Radiance Fields for Interactive Free-viewpoint Video. In SIGGRAPH Asia 2022 Conference Papers. 1–9.
  30. Devrf: Fast deformable voxel radiance fields for dynamic scenes. Advances in Neural Information Processing Systems 35 (2022), 36762–36775.
  31. Neural sparse voxel fields. Advances in Neural Information Processing Systems 33 (2020), 15651–15663.
  32. William E Lorensen and Harvey E Cline. 1987. Marching cubes: A high resolution 3D surface construction algorithm. ACM siggraph computer graphics 21, 4 (1987), 163–169.
  33. Nerf: Representing scenes as neural radiance fields for view synthesis. In European conference on computer vision.
  34. Instant neural graphics primitives with a multiresolution hash encoding. ACM Transactions on Graphics (ToG) 41, 4 (2022), 1–15.
  35. DONeRF: Towards Real-Time Rendering of Compact Neural Radiance Fields using Depth Oracle Networks. In Computer Graphics Forum, Vol. 40. Wiley Online Library, 45–59.
  36. Deep learning for deepfakes creation and detection: A survey. Computer Vision and Image Understanding 223 (2022), 103525.
  37. Unisurf: Unifying neural implicit surfaces and radiance fields for multi-view reconstruction. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 5589–5599.
  38. Nerfies: Deformable neural radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 5865–5874.
  39. HyperNeRF: A Higher-Dimensional Representation for Topologically Varying Neural Radiance Fields. ACM Trans. Graph. 40, 6, Article 238 (dec 2021).
  40. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 (2019).
  41. A 3D Face Model for Pose and Illumination Invariant Face Recognition. In 2009 Sixth IEEE International Conference on Advanced Video and Signal Based Surveillance. 296–301. https://doi.org/10.1109/AVSS.2009.58
  42. D-nerf: Neural radiance fields for dynamic scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10318–10327.
  43. Kilonerf: Speeding up neural radiance fields with thousands of tiny mlps. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 14335–14345.
  44. Merf: Memory-efficient radiance fields for real-time view synthesis in unbounded scenes. ACM Transactions on Graphics (TOG) 42, 4 (2023), 1–12.
  45. First Order Motion Model for Image Animation. In Conference on Neural Information Processing Systems (NeurIPS).
  46. Deferred neural rendering: Image synthesis using neural textures. Acm Transactions on Graphics (TOG) 38, 4 (2019), 1–12.
  47. Real-time radiance fields for single-image portrait view synthesis. ACM Transactions on Graphics (TOG) 42, 4 (2023), 1–15.
  48. Richard Tucker and Noah Snavely. 2020. Single-view View Synthesis with Multiplane Images. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  49. Fourier plenoctrees for dynamic radiance field rendering in real-time. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13524–13534.
  50. StyleAvatar: Real-time Photo-realistic Portrait Avatar from a Single Video. In ACM SIGGRAPH 2023 Conference Proceedings.
  51. NeuS: Learning Neural Implicit Surfaces by Volume Rendering for Multi-view Reconstruction. NeurIPS (2021).
  52. Photorealistic audio-driven video portraits. IEEE Transactions on Visualization and Computer Graphics 26, 12 (2020), 3457–3466.
  53. Nex: Real-time view synthesis with neural basis expansion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8534–8543.
  54. Anifacegan: Animatable 3d-aware face image generation for video avatars. Advances in Neural Information Processing Systems 35 (2022), 36188–36201.
  55. AvatarMAV: Fast 3D Head Avatar Reconstruction Using Motion-Aware Neural Voxels. In ACM SIGGRAPH 2023 Conference Proceedings.
  56. LatentAvatar: Learning Latent Expression Code for Expressive Neural Head Avatar. In ACM SIGGRAPH 2023 Conference Proceedings.
  57. Neumesh: Learning disentangled neural mesh-based implicit field for geometry and texture editing. In European Conference on Computer Vision. Springer, 597–614.
  58. Volume rendering of neural implicit surfaces. Advances in Neural Information Processing Systems 34 (2021), 4805–4815.
  59. BakedSDF: Meshing Neural SDFs for Real-Time View Synthesis. In ACM SIGGRAPH 2023 Conference Proceedings (Los Angeles, CA, USA) (SIGGRAPH ’23). Association for Computing Machinery, New York, NY, USA, Article 46, 9 pages. https://doi.org/10.1145/3588432.3591536
  60. Plenoctrees for real-time rendering of neural radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 5752–5761.
  61. NOFA: NeRF-based One-shot Facial Avatar Reconstruction. In ACM SIGGRAPH 2023 Conference Proceedings. 1–12.
  62. Fast bi-layer neural synthesis of one-shot realistic head avatars. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XII 16. Springer, 524–540.
  63. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition. 586–595.
  64. Im avatar: Implicit morphable head avatars from videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13545–13555.
  65. PointAvatar: Deformable Point-based Head Avatars from Videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
  66. Towards Metrical Reconstruction of Human Faces. In European Conference on Computer Vision (ECCV). Springer International Publishing.
  67. Instant Volumetric Head Avatars. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4574–4584.
Citations (12)

Summary

  • The paper introduces a three-phase pipeline that extracts deformable polygon meshes and bakes appearance into static textures for efficient real-time rendering.
  • The method leverages the FLAME model to drive realistic expression and pose animations, achieving interactive frame rates on commodity hardware.
  • Experimental results show a significant reduction in inference time compared to NeRF approaches, enhancing applications in AR/VR, telepresence, and gaming.

BakedAvatar: A New Approach for Real-Time Head Avatar Synthesis

The paper "BakedAvatar: Baking Neural Fields for Real-Time Head Avatar Synthesis" presents an innovative methodology aimed at addressing the computational limitations inherent in current neural radiance field (NeRF)-based high-fidelity human head avatar syntheses, by proposing a novel representation termed BakedAvatar. This method is explicitly designed for applicability within standard polygon rasterization pipelines, ensuring real-time rendering capability even on commodity hardware such as mobile phones and tablets—a significant advancement over current state-of-the-art methods.

Core Contributions

The central thesis of the paper rests on a three-phase pipeline: First, it seeks to extract deformable, multi-layer polygon meshes from learned isosurfaces. Second, it computes expression-, pose-, and view-dependent appearances that are subsequently integrated as static texture maps, thus facilitating a standard polygon rasterization approach. Third, through a fine-tuning procedure involving differential rasterization, the method optimizes the fidelity and accuracy of these synthetic avatars.

  1. Mesh and Texture Extraction: The authors meticulously detail a process for learning continuous deformation, manifold, and radiance fields. These fields enable the extraction of mesh structures that are subsequently used to bake appearance into textures at significant computational advantage over volumetric ray-marching.
  2. Expression and Pose Driven Animation: BakedAvatar exploits the FLAME model to enable realistic deformations, which allows for interactive frame rates in expression and pose editing applications.
  3. Rendering Performance: Experimentation on real-time rendering showcases the method's efficacy across various applications, including view synthesis, face reenactment, and expression/pose editing. The numerical results indicate a significant reduction in inference time relative to comparable state-of-the-art systems, with the potential to run at interactive frame rates on a variety of devices, including laptops and mobile phones.

Implications and Future Directions

The implications of this work are multiplicative. From a practical perspective, its capacity to synthesize lifelike avatars efficiently could revolutionize fields like AR/VR, telepresence, and game development by reducing latency issues and computational overhead, thereby broadening the accessibility of high-fidelity avatar synthesis.

Theoretically, the proposed system opens new vistas in real-time rendering research, hinting at a broader applicability of "baking" strategies for neural-rendering across diverse media applications. It challenges the supremacy of fully implicit scene representations by leveraging efficient mesh-based approximations.

Avenues for future work might explore further optimizing the spatial and temporal resolution of extracted textures, enhancing the realism of dynamic lighting and shadows through advanced lighting models, and improving the adaptability of the method for full-body avatar synthesis.

Conclusion

"BakedAvatar" exemplifies a tangible step forward in real-time neural avatar rendering, marrying high-fidelity results with computational tractability in ways that current NeRF-based approaches have struggled to achieve. By transitioning from dense ray sampling techniques to efficient mesh-based polygonal rendering using layered neural fields, the authors not only affirm the viability of rasterization pipelines in producing realistic dynamic avatars but also broaden the horizon for real-time applications in various computational landscapes. This work stands as a compelling chapter in the ongoing evolution of avatar synthesis technologies.

Youtube Logo Streamline Icon: https://streamlinehq.com