Fast Bi-layer Neural Synthesis of One-Shot Realistic Head Avatars (2008.10174v1)

Published 24 Aug 2020 in cs.CV, cs.GR, and cs.LG

Abstract: We propose a neural rendering-based system that creates head avatars from a single photograph. Our approach models a person's appearance by decomposing it into two layers. The first layer is a pose-dependent coarse image that is synthesized by a small neural network. The second layer is defined by a pose-independent texture image that contains high-frequency details. The texture image is generated offline, warped and added to the coarse image to ensure a high effective resolution of synthesized head views. We compare our system to analogous state-of-the-art systems in terms of visual quality and speed. The experiments show significant inference speedup over previous neural head avatar models for a given visual quality. We also report on a real-time smartphone-based implementation of our system.

Citations (151)

View on Semantic Scholar

Summary

The paper presents a bi-layer neural rendering framework that generates one-shot photorealistic head avatars with 42 ms inference on a smartphone GPU.
It decomposes the synthesis process into a coarse, pose-dependent layer and an offline refined high-frequency texture layer to preserve fine details.
Meta-learning on a diverse dataset ensures robust generalization, maintaining accurate identity preservation and pose alignment across varied inputs.

Fast Bi-layer Neural Synthesis of One-Shot Realistic Head Avatars

The paper presents a novel approach to synthesizing realistic head avatars from a single image using a bi-layer neural rendering framework. This method aims to efficiently generate photorealistic avatars by leveraging a two-layer decomposition of the person's visual appearance: a pose-dependent coarse layer and a pose-independent detailed texture layer. Such a decomposition enables the neural network to operate significantly faster compared to existing frameworks, which typically entail substantial computational resources and time to achieve similar levels of realism.

Methodology Overview

The proposed approach involves a small neural network synthesizing a coarse image based on the pose, coupled with a separate high-frequency texture image generated offline. At the heart of this technique lies the bi-layer synthesis process:

Coarse Layer Synthesis: A lightweight neural network predicts a coarse image that captures essential facial geometry and pose. This network operates with reduced complexity, ensuring rapid inference times.
Texture Layer Integration: A static, high-resolution texture image, precomputed and refined through a process akin to meta-learning across multiple individuals, is warped to align with the predicted coarse layer. This integration ensures detailed features are preserved.
Meta-Learning for Generalization: During training, meta-learning on a broad dataset enables the network to generalize appearance features across various inputs, ensuring that the system can generate attributes for distinct individuals even from a single input image.

Results and Evaluation

Compared to existing methods, the proposed system demonstrates a substantial improvement in inference speed, achieving a rendering time of 42 milliseconds on a smartphone GPU (Adreno 640, Snapdragon 855), making real-time mobile deployment viable. In terms of visual fidelity, the method competes favorably against state-of-the-art systems, producing convincing results with minimal identity and pose discrepancies.

Key metrics such as learned perceptual image patch similarity (LPIPS), cosine similarity of identity preservation (CSIM), and normalized mean error (NME) of pose alignment indicate that the system effectively balances speed with quality. The creators report significant advancements in user studies, where their approach often matches or surpasses alternatives in terms of subjective quality perception.

Implications and Future Directions

Practically, this research streamlines the creation of user-specific avatars with minimal input, enhancing applications in telepresence, gaming, augmented reality, and content creation. The bi-layer design, which compartmentalizes the synthesis into distinct phases for speed and detail, could serve as a foundation for further innovation in neural rendering.

Future explorations might investigate enhancing the robustness of texture warping, leveraging more sophisticated meta-learning protocols, or integrating additional modalities (e.g., audio) for more dynamic avatar interactions. As deep learning techniques evolve, combining them with efficient rendering strategies like the one proposed here could redefine interactive virtual environments across less computationally capable platforms. Such technical evolution holds great promise not only for individual-focused applications but also for broader advancements in AI-driven image synthesis.

PDF Markdown

Related Papers

YouTube

Show All Videos