VOODOO 3D: Volumetric Portrait Disentanglement for One-Shot 3D Head Reenactment (2312.04651v1)
Abstract: We present a 3D-aware one-shot head reenactment method based on a fully volumetric neural disentanglement framework for source appearance and driver expressions. Our method is real-time and produces high-fidelity and view-consistent output, suitable for 3D teleconferencing systems based on holographic displays. Existing cutting-edge 3D-aware reenactment methods often use neural radiance fields or 3D meshes to produce view-consistent appearance encoding, but, at the same time, they rely on linear face models, such as 3DMM, to achieve its disentanglement with facial expressions. As a result, their reenactment results often exhibit identity leakage from the driver or have unnatural expressions. To address these problems, we propose a neural self-supervised disentanglement approach that lifts both the source image and driver video frame into a shared 3D volumetric representation based on tri-planes. This representation can then be freely manipulated with expression tri-planes extracted from the driving images and rendered from an arbitrary view using neural radiance fields. We achieve this disentanglement via self-supervised learning on a large in-the-wild video dataset. We further introduce a highly effective fine-tuning approach to improve the generalizability of the 3D lifting using the same real-world data. We demonstrate state-of-the-art performance on a wide range of datasets, and also showcase high-quality 3D-aware head reenactment on highly challenging and diverse subjects, including non-frontal head poses and complex expressions for both source and driver.
- ItSeez3D AvatarSDK, https://avatarsdk.com.
- in3D, https://in3d.io.
- Leia, https://www.leiainc.com.
- Looking Glass Factory, https://lookingglassfactory.com.
- Pinscreen Avatar Neo, https://www.avatarneo.com.
- ReadyPlayerMe, https://readyplayer.me.
- Panohead: Geometry-aware 3d full-head synthesis in 360deg. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
- Rignerf: Fully controllable neural 3d portraits. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 20364–20373, 2022.
- Ffhq-uv: Normalized facial uv-texture dataset for 3d face reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 362–371, 2023a.
- Learning personalized high quality volumetric head avatars from monocular rgb videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023b.
- FLARE: Fast learning of animatable and relightable mesh avatars. ACM Transactions on Graphics, 42:15, 2023.
- Triplanenet: An encoder for eg3d inversion. arXiv preprint arXiv:2303.13497, 2023.
- Deep relightable appearance models for animatable faces. ACM Transactions on Graphics (TOG), 40, 2021.
- A morphable model for the synthesis of 3d faces. In Proceedings of the 26th Annual Conference on Computer Graphics and Interactive Techniques, page 187–194, USA, 1999. ACM Press/Addison-Wesley Publishing Co.
- A Morphable Model For The Synthesis Of 3D Faces. Association for Computing Machinery, New York, NY, USA, 1 edition, 2023.
- How far are we from solving the 2d & 3d face alignment problem?(and a dataset of 230,000 3d facial landmarks). In Proceedings of the IEEE international conference on computer vision, pages 1021–1030, 2017.
- Neural head reenactment with latent pose descriptors. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13786–13795, 2020.
- Authentic volumetric avatars from a phone scan. ACM Trans. Graph., 41, 2022.
- Pi-gan: Periodic implicit generative adversarial networks for 3d-aware image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5799–5809, 2021.
- Efficient geometry-aware 3d generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16123–16133, 2022.
- Implicit neural head synthesis via controllable local deformation fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 416–426, 2023.
- Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587, 2017.
- Emoca: Emotion driven monocular face capture and animation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20311–20322, 2022.
- Accurate 3d face reconstruction with weakly-supervised learning: From single image to image set. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 0–0, 2019.
- Gram: Generative radiance manifolds for 3d-aware image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10673–10683, 2022.
- An image is worth 16x16 words: Transformers for image recognition at scale. ArXiv, abs/2010.11929, 2020.
- Headgan: One-shot neural head synthesis and editing. In IEEE/CVF International Conference on Computer Vision (ICCV), 2021.
- Megaportraits: One-shot megapixel neural head avatars. arXiv preprint arXiv:2207.07621, 2022.
- Plenoxels: Radiance fields without neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5501–5510, 2022.
- Auto-card: Efficient and robust codec avatar driving for real-time mobile telepresence. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 21036–21045, 2023.
- Dynamic neural radiance fields for monocular 4d facial avatar reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8649–8658, 2021a.
- Dynamic neural radiance fields for monocular 4d facial avatar reconstruction. In Proc. Computer Vision and Pattern Recognition (CVPR), IEEE, 2021b.
- High-fidelity and freely controllable talking head video generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5609–5619, 2023.
- Fast-ganfit: Generative adversarial network for high fidelity 3d face reconstruction. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021.
- Neural head avatars from monocular rgb videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18653–18664, 2022.
- Depth-aware generative adversarial network for talking head video generation. 2022a.
- Headnerf: A real-time nerf-based parametric head model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20374–20384, 2022b.
- Avatar digitization from a single image for real-time rendering. ACM Trans. Graph., 36(6), 2017.
- Mofanerf: Morphable facial neural radiance field. In ECCV, 2022.
- Eamm: One-shot emotional talking face via audio-based emotion-aware motion model. In ACM SIGGRAPH 2022 Conference Proceedings, 2022.
- Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196, 2017.
- A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
- Analyzing and improving the image quality of StyleGAN. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
- Realistic one-shot mesh-based head avatars. In European Conference of Computer vision (ECCV), 2022.
- Deep video portraits. ACM Transactions on Graphics 2018 (TOG), 2018.
- 3d gan inversion with pose optimization. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2967–2976, 2023.
- Avatarme++: Facial shape and brdf inference with photorealistic rendering-aware gans. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021.
- Fitme: Deep photorealistic 3d morphable model avatars. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8629–8640, 2023.
- Project starline: A high-fidelity telepresence system. ACM Transactions on Graphics (Proc. SIGGRAPH Asia), 40(6), 2021.
- A hierarchical representation network for accurate and detailed face reconstruction from in-the-wild images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 394–403, 2023.
- Facial performance sensing head-mounted display. ACM Transactions on Graphics (Proceedings SIGGRAPH 2015), 34(4), 2015.
- Learning a model of facial shape and expression from 4D scans. ACM Transactions on Graphics, (Proc. SIGGRAPH Asia), 36(6):194:1–194:17, 2017a.
- Learning a model of facial shape and expression from 4d scans. ACM Trans. Graph., 36(6):194–1, 2017b.
- One-shot high-fidelity talking-head synthesis with deformable neural radiance field. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 17969–17978, 2023a.
- Generalizable one-shot neural head avatar. arXiv preprint arXiv:2306.08768, 2023b.
- Geometric gan. arXiv preprint arXiv:1705.02894, 2017.
- Deep appearance models for face rendering. ACM Trans. Graph., 37(4), 2018.
- Decoupled weight decay regularization. In International Conference on Learning Representations, 2017.
- Normalized avatar synthesis using stylegan and perceptual refinement. CoRR, abs/2106.11423, 2021.
- Pixel codec avatars. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 64–73, 2021.
- Otavatar: One-shot talking face avatar with controllable tri-plane rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16901–16910, 2023.
- Nerf: Representing scenes as neural radiance fields for view synthesis, 2020.
- Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1):99–106, 2021.
- Giraffe: Representing scenes as compositional generative neural feature fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11453–11464, 2021.
- FSGAN: Subject agnostic face swapping and reenactment. In Proceedings of the IEEE International Conference on Computer Vision, pages 7184–7193, 2019.
- High-fidelity facial and speech animation for vr hmds. ACM Transactions on Graphics (TOG), 35:1 – 14, 2016.
- Stylesdf: High-resolution 3d-consistent image and geometry generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13503–13513, 2022.
- Pirenderer: Controllable portrait image generation via semantic neural rendering. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 13759–13768, 2021.
- Encoding in style: a stylegan encoder for image-to-image translation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
- Pivotal tuning for latent-based editing of real images. ACM Trans. Graph., 2021.
- Projected gans converge faster. Advances in Neural Information Processing Systems, 34:17480–17492, 2021.
- The eyes have it: An integrated eye and face model for photorealistic facial animation. ACM Trans. Graph., 39(4), 2020.
- Graf: Generative radiance fields for 3d-aware image synthesis. In Proceedings of the 34th International Conference on Neural Information Processing Systems, 2020.
- Voxgraf: Fast 3d-aware image synthesis with sparse voxel grids. Advances in Neural Information Processing Systems, 35:33999–34011, 2022.
- Animating arbitrary objects via deep motion transfer. In CVPR, 2019a.
- First order motion model for image animation. In Conference on Neural Information Processing Systems (NeurIPS), 2019b.
- Motion representations for articulated animation. In CVPR, 2021.
- EpiGRAF: Rethinking training of 3d GANs. In Advances in Neural Information Processing Systems, 2022.
- Pareidolia face reenactment. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
- Structure-aware motion transfer with deformable anchor model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3637–3646, 2022.
- Face2face: Real-time face capture and reenactment of rgb videos. In Proc. Computer Vision and Pattern Recognition (CVPR), IEEE, 2016.
- Headon: Real-time reenactment of human portrait videos. ACM Transactions on Graphics 2018 (TOG), 2018.
- Designing an encoder for stylegan image manipulation. ACM Transactions on Graphics (TOG), 40(4):1–14, 2021.
- Real-time radiance fields for single-image portrait view synthesis. ACM Transactions on Graphics (TOG), 42(4):1–15, 2023.
- One-shot free-view neural talking-head synthesis for video conferencing. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10039–10049, 2021a.
- Towards real-world blind face restoration with generative facial prior. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021b.
- Towards real-world blind face restoration with generative facial prior. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9168–9178, 2021c.
- Latent image animator: Learning to animate images via latent space navigation. In International Conference on Learning Representations, 2022.
- X2face: A network for controlling face generation by using images, audio, and pose codes. In European Conference on Computer Vision, 2018.
- Gram-hd: 3d-consistent image generation at high resolution with generative radiance manifolds. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 2195–2205, 2023.
- Segformer: Simple and efficient design for semantic segmentation with transformers. Advances in Neural Information Processing Systems, 34:12077–12090, 2021.
- High-fidelity 3d gan inversion by pseudo-multi-view optimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 321–331, 2023.
- Omniavatar: Geometry-guided controllable 3d head synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12814–12824, 2023a.
- Latentavatar: Learning latent expression code for expressive neural head avatar. In ACM SIGGRAPH 2023 Conference Proceedings. Association for Computing Machinery, 2023b.
- Pv3d: A 3d generative model for portrait video generation. In The Tenth International Conference on Learning Representations, 2023c.
- Giraffe hd: A high-resolution 3d-aware generative model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18440–18449, 2022.
- Face2face ρ𝜌\rhoitalic_ρ: Real-time high-resolution one-shot face reenactment. 2022.
- Styleheat: One-shot high-resolution editable talking face generation via pre-trained stylegan. In ECCV, 2022.
- Nerfinvertor: High fidelity nerf-gan inversion for single-shot real image animation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8539–8548, 2023.
- Nofa: Nerf-based one-shot facial avatar reconstruction. In ACM SIGGRAPH 2023 Conference Proceedings, 2023.
- Make encoder great again in 3d gan inversion through geometry and occlusion-aware encoding. arXiv preprint arXiv:2303.12326, 2023.
- Few-shot adversarial learning of realistic neural talking head models. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9459–9468, 2019.
- Fast bi-layer neural synthesis of one-shot realistic head avatars. In European Conference on Computer Vision, pages 524–540. Springer, 2020.
- Metaportrait: Identity-preserving talking head generation with fast personalized adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22096–22105, 2023.
- Fdnerf: Few-shot dynamic neural radiance fields for face reconstruction and expression editing. arXiv preprint arXiv:2208.05751, 2022.
- The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018.
- Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3661–3670, 2021.
- Thin-plate spline motion model for image animation. In CVPR, pages 3657–3666, 2022.
- Havatar: High-fidelity head avatar via facial model conditioned neural radiance field. ACM Trans. Graph., 2023.
- I m avatar: Implicit morphable head avatars from videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13545–13555, 2022.
- Pointavatar: Deformable point-based head avatars from videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
- Celebv-hq: A large-scale video facial attributes dataset. In European conference on computer vision, pages 650–667. Springer, 2022a.
- CelebV-HQ: A large-scale video facial attributes dataset. In ECCV, 2022b.
- Instant volumetric head avatars. In IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 2023.