StyleAvatar: Real-time Photo-realistic Portrait Avatar from a Single Video (2305.00942v1)

Published 1 May 2023 in cs.CV

Abstract: Face reenactment methods attempt to restore and re-animate portrait videos as realistically as possible. Existing methods face a dilemma in quality versus controllability: 2D GAN-based methods achieve higher image quality but suffer in fine-grained control of facial attributes compared with 3D counterparts. In this work, we propose StyleAvatar, a real-time photo-realistic portrait avatar reconstruction method using StyleGAN-based networks, which can generate high-fidelity portrait avatars with faithful expression control. We expand the capabilities of StyleGAN by introducing a compositional representation and a sliding window augmentation method, which enable faster convergence and improve translation generalization. Specifically, we divide the portrait scenes into three parts for adaptive adjustments: facial region, non-facial foreground region, and the background. Besides, our network leverages the best of UNet, StyleGAN and time coding for video learning, which enables high-quality video generation. Furthermore, a sliding window augmentation method together with a pre-training strategy are proposed to improve translation generalization and training performance, respectively. The proposed network can converge within two hours while ensuring high image quality and a forward rendering time of only 20 milliseconds. Furthermore, we propose a real-time live system, which further pushes research into applications. Results and experiments demonstrate the superiority of our method in terms of image quality, full portrait video generation, and real-time re-animation compared to existing facial reenactment methods. Training and inference code for this paper are at https://github.com/LizhenWangT/StyleAvatar.

Citations (32)

View on Semantic Scholar

Summary

The paper introduces a novel StyleGAN-based framework that generates real-time, high-fidelity portrait avatars using compositional representation and sliding window data augmentation.
It leverages integrated techniques including UNet, time coding, and Neural Textures to achieve 20ms rendering times and enhanced temporal stability.
Empirical evaluations demonstrate superior SSIM, PSNR, and FID metrics compared to existing methods, underscoring its efficiency for interactive graphics applications.

StyleAvatar: Real-time Photo-realistic Portrait Avatar from a Single Video

The paper "StyleAvatar: Real-time Photo-realistic Portrait Avatar from a Single Video" presents a computational methodology designed to generate high-fidelity portrait avatars in real time using a novel application of StyleGAN-based networks. The authors propose an efficient framework capable of not only achieving high-quality image generation but also allowing fine-grained control over facial attributes, which addresses existing trade-offs in the domain of facial reenactment methodologies.

Technical Innovation and Methodology

The core innovation of the StyleAvatar framework lies in its utilization of StyleGAN®, integrated with a compositional representation and a sliding window data augmentation technique. The compositional representation divides the video portrait into three distinct segments: the facial region, non-facial foreground region, and the background. This division allows for adaptive adjustments tailored to each region's characteristics, facilitating improved image quality and stability.

StyleAvatar leverages the strengths of UNet, StyleGAN, and time coding, specifically designed for video learning, enabling the method to produce detailed and temporally consistent portrait reconstructions. Notably, the method incorporates Neural Textures to expedite convergence and enhance rendering fidelity, crucial for achieving its 20ms rendering time.

Numerical Results and Implications

Empirical evaluations demonstrate that StyleAvatar significantly surpasses existing methods in terms of image fidelity, as evidenced by values of SSIM, PSNR, and FID, which are consistently superior compared to other techniques like DaGAN and Next3D. In addition to high-quality video generation, the framework's training efficiency is noteworthy, with convergence achieved within two hours—a considerable improvement over benchmarks requiring substantially longer periods.

The proposed framework's potential applications span various domains, most notably in real-time avatar reanimation systems, where the capability to render a digital portrait within milliseconds presents vast possibilities for interactive graphics applications.

Challenges and Future Directions

Despite its advantages, the system exhibits limitations, particularly in modeling expressions and poses extending beyond the original video's variance. Future research could explore the integration of more sophisticated 3D modeling techniques to offer comprehensive control over exaggerated expressions and rotations. Moreover, enhancing the system's ability to accurately capture fine-grained mouth movements during reenactment remains an area for expansion.

The pre-training strategy, augmented by a compact video dataset, proved effective in accelerating training times. However, future studies may consider larger and more diverse datasets to maximize the generalization capabilities of the model.

Conclusion

The proposed StyleAvatar framework represents an advancement in the field of real-time video portrait generation, with superior performance metrics and computational efficiency. While addressing key limitations related to expression control and rotational modeling, the model lays the groundwork for future exploration and application in AI-driven facial reenactment technologies. This work could potentially inspire novel research trajectories within the intersection of graphics and interactive system domains.

PDF Markdown

Related Papers

GitHub

GitHub - LizhenWangT/StyleAvatar: Code of SIGGRAPH 2023 Conference paper: StyleAvatar: Real-time Photo-realistic Portrait Avatar from a Single Video (410 stars)