Fast Registration of Photorealistic Avatars for VR Facial Animation (2401.11002v2)

Published 19 Jan 2024 in cs.CV and cs.AI

Abstract: Virtual Reality (VR) bares promise of social interactions that can feel more immersive than other media. Key to this is the ability to accurately animate a personalized photorealistic avatar, and hence the acquisition of the labels for headset-mounted camera (HMC) images need to be efficient and accurate, while wearing a VR headset. This is challenging due to oblique camera views and differences in image modality. In this work, we first show that the domain gap between the avatar and HMC images is one of the primary sources of difficulty, where a transformer-based architecture achieves high accuracy on domain-consistent data, but degrades when the domain-gap is re-introduced. Building on this finding, we propose a system split into two parts: an iterative refinement module that takes in-domain inputs, and a generic avatar-guided image-to-image domain transfer module conditioned on current estimates. These two modules reinforce each other: domain transfer becomes easier when close-to-groundtruth examples are shown, and better domain-gap removal in turn improves the registration. Our system obviates the need for costly offline optimization, and produces online registration of higher quality than direct regression method. We validate the accuracy and efficiency of our approach through extensive experiments on a commodity headset, demonstrating significant improvements over these baselines. To stimulate further research in this direction, we make our large-scale dataset and code publicly available.

References (35)

Summary

The paper presents a two-module system that decouples registration into iterative refinement and avatar-guided style transfer for fast and accurate VR facial animation.
It employs a transformer-based architecture to refine expression estimation and head pose, significantly outperforming direct regression methods on a dataset of 208 identities.
The approach bridges the modality gap between headset-mounted camera images and avatar renderings, enabling real-time, immersive VR social interactions without extensive offline training.

Introduction

The paper addresses the challenge of animating photorealistic avatars in Virtual Reality (VR) environments in real time using headset-mounted camera (HMC) images. While offline methods have achieved high-quality results, they are unsuitable for real-time applications where efficient and general solutions are required. The primary obstacle in animating avatars in real time is the domain gap between the avatar's rendering and the HMC images. This work introduces a system design that efficiently bridges this gap, enabling fast and accurate registration of avatars for VR facial animation.

Related Work

In the domain of VR face tracking, the restrictive camera placements and the occlusive nature of VR headsets present unique challenges. Previous efforts have experimented with various hardware configurations and sensor technologies to capture facial data. However, methods like adding protruding camera mounts or employing RGBD sensors to register geometry have limitations. Recent strides exploit differentiable rendering and style transfer while training person-specific models, yet these approaches demand elaborate setups and long training times, hindering their real-time application.

Methodology

The presented system decouples the registration task into two modules: an iterative refinement module and an avatar-guided image-to-image style transfer module. The iterative refinement module, driven by in-domain inputs, focuses on refining the expression and head pose estimations. In parallel, the style transfer module transforms the monochromatic HMC images into the avatar's domain, conditioned on the current expression estimation and head pose. These modules interplay to incrementally enhance each other's performance, utilizing the transformer-based architecture for heightened accuracy across unseen identities.

Validation and Results

Empirical evaluation on commodity VR hardware showcased substantial improvements over direct regression methods. The system was validated using a dataset of 208 identities, each captured with a refined setup and a standard VR headset to provide ground truth correspondences. The iterative approach significantly outperforms regression techniques, particularly in robustness against novel appearance variations in unseen identities. This innovative system architecture produces consistently high-quality results, thus obviating the need for costly offline registration for personalizing labels.

Conclusion

This paper contributes a novel framework enabling efficient, accurate, and generic VR avatar face registration. By leveraging a transformer-based architecture capable of refining expression estimation in conjunction with a domain-transfer function conditioned on photorealistic avatar renderings, the system effectively generalizes to unseen identities. The result is a high-performing expression estimation system that circumvents the need for pre-processing and opens the door for real-time, immersive VR social interactions. Future work will explore further optimization for speed and potential applications in live settings.

Related Papers

Tweets

https://twitter.com/_akhaliq/status/1749677463406272695

https://twitter.com/javaeeeee1/status/1751590876969660860