Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
156 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Fast Registration of Photorealistic Avatars for VR Facial Animation (2401.11002v2)

Published 19 Jan 2024 in cs.CV and cs.AI

Abstract: Virtual Reality (VR) bares promise of social interactions that can feel more immersive than other media. Key to this is the ability to accurately animate a personalized photorealistic avatar, and hence the acquisition of the labels for headset-mounted camera (HMC) images need to be efficient and accurate, while wearing a VR headset. This is challenging due to oblique camera views and differences in image modality. In this work, we first show that the domain gap between the avatar and HMC images is one of the primary sources of difficulty, where a transformer-based architecture achieves high accuracy on domain-consistent data, but degrades when the domain-gap is re-introduced. Building on this finding, we propose a system split into two parts: an iterative refinement module that takes in-domain inputs, and a generic avatar-guided image-to-image domain transfer module conditioned on current estimates. These two modules reinforce each other: domain transfer becomes easier when close-to-groundtruth examples are shown, and better domain-gap removal in turn improves the registration. Our system obviates the need for costly offline optimization, and produces online registration of higher quality than direct regression method. We validate the accuracy and efficiency of our approach through extensive experiments on a commodity headset, demonstrating significant improvements over these baselines. To stimulate further research in this direction, we make our large-scale dataset and code publicly available.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (35)
  1. Artflow: Unbiased image style transfer via reversible neural flows. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 862–871, 2021.
  2. Instructpix2pix: Learning to follow image editing instructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18392–18402, 2023.
  3. How far are we from solving the 2d & 3d face alignment problem? (and a dataset of 230,000 3d facial landmarks). In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017.
  4. Displaced dynamic expression regression for real-time facial tracking and animation. ACM Trans. Graph., 33(4), 2014.
  5. Authentic volumetric avatars from a phone scan. ACM Trans. Graph., 41(4), 2022.
  6. Artistic style transfer with internal-external learning and contrastive learning. Advances in Neural Information Processing Systems, 34:26561–26573, 2021.
  7. Stargan: Unified generative adversarial networks for multi-domain image-to-image translation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018.
  8. Stytr22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT: Image style transfer with transformers. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  9. Cascaded pose regression. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 1078–1085, 2010.
  10. An image is worth 16x16 words: Transformers for image recognition at scale. ArXiv, abs/2010.11929, 2020.
  11. Image style transfer using convolutional neural networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  12. Searching for mobilenetv3. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1314–1324, 2019.
  13. Image-to-image translation with conditional adversarial networks. CVPR, 2017.
  14. One millisecond face alignment with an ensemble of regression trees. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014.
  15. Big transfer (bit): General visual representation learning. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part V 16, pages 491–507. Springer, 2020.
  16. Facial performance sensing head-mounted display. ACM Transactions on Graphics (TOG), 34(4):47:1–47:9, 2015.
  17. On the variance of the adaptive learning rate and beyond. arXiv preprint arXiv:1908.03265, 2019.
  18. Adaattn: Revisit attention mechanism in arbitrary neural style transfer. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6649–6658, 2021.
  19. Deep appearance models for face rendering. ACM Trans. Graph., 37(4):68:1–68:13, 2018.
  20. Neural volumes: Learning dynamic renderable volumes from images. arXiv preprint arXiv:1906.07751, 2019.
  21. Mixture of volumetric primitives for efficient neural rendering. ACM Trans. Graph., 40(4), 2021.
  22. Meta Inc. Meta Quest Pro: Premium Mixed Reality. https://www.meta.com/ie/quest/quest-pro/, 2023.
  23. High-fidelity facial and speech animation for vr hmds. ACM Transactions on Graphics (TOG), 35(6):1–14, 2016.
  24. Stand-alone self-attention in vision models. 2019.
  25. Face alignment at 3000 fps via regressing local binary features. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, pages 1685–1692, 2014.
  26. Iterative error bound minimisation for aam alignment. In Proceedings of the 18th International Conference on Pattern Recognition - Volume 02, page 1196–1195, USA, 2006. IEEE Computer Society.
  27. The eyes have it: An integrated eye and face model for photorealistic facial animation. ACM Trans. Graph., 39(4), 2020.
  28. Textured neural avatars. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2382–2392, 2019.
  29. Facevr: Real-time gaze-aware facial reenactment in virtual reality. ACM Transactions on Graphics (TOG), 37(2):25:1–25:15, 2018.
  30. Vr facial animation via multiview image translation. ACM Trans. Graph., 38(4), 2019.
  31. Styleformer: Real-time arbitrary style transfer via parametric style composition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 14618–14627, 2021.
  32. Sparse local patch transformer for robust face alignment and landmarks inherent relation learning. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4042–4051, 2022.
  33. Xuehan Xiong and Fernando De la Torre. Supervised descent method and its applications to face alignment. 2013 IEEE Conference on Computer Vision and Pattern Recognition, pages 532–539, 2013.
  34. On the continuity of rotation representations in neural networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  35. Unpaired image-to-image translation using cycle-consistent adversarial networks. In IEEE International Conference on Computer Vision (ICCV), 2017.

Summary

  • The paper presents a two-module system that decouples registration into iterative refinement and avatar-guided style transfer for fast and accurate VR facial animation.
  • It employs a transformer-based architecture to refine expression estimation and head pose, significantly outperforming direct regression methods on a dataset of 208 identities.
  • The approach bridges the modality gap between headset-mounted camera images and avatar renderings, enabling real-time, immersive VR social interactions without extensive offline training.

Introduction

The paper addresses the challenge of animating photorealistic avatars in Virtual Reality (VR) environments in real time using headset-mounted camera (HMC) images. While offline methods have achieved high-quality results, they are unsuitable for real-time applications where efficient and general solutions are required. The primary obstacle in animating avatars in real time is the domain gap between the avatar's rendering and the HMC images. This work introduces a system design that efficiently bridges this gap, enabling fast and accurate registration of avatars for VR facial animation.

Related Work

In the domain of VR face tracking, the restrictive camera placements and the occlusive nature of VR headsets present unique challenges. Previous efforts have experimented with various hardware configurations and sensor technologies to capture facial data. However, methods like adding protruding camera mounts or employing RGBD sensors to register geometry have limitations. Recent strides exploit differentiable rendering and style transfer while training person-specific models, yet these approaches demand elaborate setups and long training times, hindering their real-time application.

Methodology

The presented system decouples the registration task into two modules: an iterative refinement module and an avatar-guided image-to-image style transfer module. The iterative refinement module, driven by in-domain inputs, focuses on refining the expression and head pose estimations. In parallel, the style transfer module transforms the monochromatic HMC images into the avatar's domain, conditioned on the current expression estimation and head pose. These modules interplay to incrementally enhance each other's performance, utilizing the transformer-based architecture for heightened accuracy across unseen identities.

Validation and Results

Empirical evaluation on commodity VR hardware showcased substantial improvements over direct regression methods. The system was validated using a dataset of 208 identities, each captured with a refined setup and a standard VR headset to provide ground truth correspondences. The iterative approach significantly outperforms regression techniques, particularly in robustness against novel appearance variations in unseen identities. This innovative system architecture produces consistently high-quality results, thus obviating the need for costly offline registration for personalizing labels.

Conclusion

This paper contributes a novel framework enabling efficient, accurate, and generic VR avatar face registration. By leveraging a transformer-based architecture capable of refining expression estimation in conjunction with a domain-transfer function conditioned on photorealistic avatar renderings, the system effectively generalizes to unseen identities. The result is a high-performing expression estimation system that circumvents the need for pre-processing and opens the door for real-time, immersive VR social interactions. Future work will explore further optimization for speed and potential applications in live settings.