Avat3r: Large Animatable Gaussian Reconstruction Model for High-fidelity 3D Head Avatars (2502.20220v1)

Published 27 Feb 2025 in cs.CV

Abstract: Traditionally, creating photo-realistic 3D head avatars requires a studio-level multi-view capture setup and expensive optimization during test-time, limiting the use of digital human doubles to the VFX industry or offline renderings. To address this shortcoming, we present Avat3r, which regresses a high-quality and animatable 3D head avatar from just a few input images, vastly reducing compute requirements during inference. More specifically, we make Large Reconstruction Models animatable and learn a powerful prior over 3D human heads from a large multi-view video dataset. For better 3D head reconstructions, we employ position maps from DUSt3R and generalized feature maps from the human foundation model Sapiens. To animate the 3D head, our key discovery is that simple cross-attention to an expression code is already sufficient. Finally, we increase robustness by feeding input images with different expressions to our model during training, enabling the reconstruction of 3D head avatars from inconsistent inputs, e.g., an imperfect phone capture with accidental movement, or frames from a monocular video. We compare Avat3r with current state-of-the-art methods for few-input and single-input scenarios, and find that our method has a competitive advantage in both tasks. Finally, we demonstrate the wide applicability of our proposed model, creating 3D head avatars from images of different sources, smartphone captures, single images, and even out-of-domain inputs like antique busts. Project website: https://tobias-kirschstein.github.io/avat3r/

Summary

The paper introduces Avat3r, a novel method that reconstructs high-fidelity, animatable 3D head avatars from a few images using Gaussian predictions.
The paper leverages DUSt3R and Sapiens models with a Vision Transformer backbone to extract accurate position and feature maps for robust 3D modeling.
The paper demonstrates competitive performance in various input scenarios, enabling smoother, cost-effective avatar creation on consumer-grade GPUs.

This paper introduces Avat3r, a new method for creating high-quality, animatable 3D head avatars from just a few images of a person's head. This is important because traditional methods require expensive studio setups and a lot of computation, making it difficult for regular users to create their own digital avatars. Avat3r simplifies this process, allowing users to create a 3D avatar from a few smartphone photos in minutes using a single, consumer-grade graphics processing unit (GPU).

Here’s a breakdown of how Avat3r works:

1. Background and Challenges

Creating realistic 3D head avatars from a small number of images is difficult for a few reasons:

Sparse 3D Reconstruction: When you only have a few images, parts of the head (like the inside of the mouth or the sides) might not be visible in any of the pictures, making it hard to create a complete 3D model.
Face Animation: Animating the face to show different expressions requires the system to understand how the face should move, even if it has never seen that particular person making those expressions.
Robust Reconstruction: The input images might not be perfect. For example, the person might have moved slightly while the pictures were being taken, leading to inconsistencies.

Avat3r is designed to address all of these challenges simultaneously.

2. Method Overview

Avat3r uses a Vision Transformer architecture to predict 3D Gaussians for each pixel in the input images. Instead of starting with a basic 3D head shape and adjusting it, Avat3r predicts a set of 3D Gaussians directly from the images. These Gaussians are small 3D shapes that, when combined, create the final 3D head avatar.

Here's a step-by-step breakdown:

Input: The system takes a few images of a person’s head ( $I$ ), their corresponding camera parameters ( $\pi$ ), and a code ( $z_{exp}$ ) that describes the desired facial expression as input. The camera parameters tell the system where the camera was when each picture was taken. $I = \{I_1, ..., I_V\}$ are the $V$ input images. $\pi$ are their corresponding camera parameters. $z_{exp}$ is a code describing the desired facial expression.
Foundation Models: Avat3r uses two pre-trained models to help with the reconstruction:
- DUSt3R: This model creates "position maps" ( $I^{pos}$ ) for each input image. These maps provide a rough estimate of the 3D position of each pixel in the image, giving the 3D Gaussians a starting point. DUSt3R also gives a confidence map ( $I^{conf}$ ), indicating how reliable the position estimate is for each pixel.
  
  $I^{pos}, I^{conf} = Dust3r(I, \pi)$
- Sapiens: This model creates feature maps ( $I^{feat}$ ) for each image. These feature maps capture important information about the image, such as the shape and texture of the face, which helps the system match features across different views.
  
  $I^{feat} = Sapiens(I)$
Vision Transformer Backbone: The core of Avat3r is a Vision Transformer. The input images, position maps, and feature maps are converted into a series of "tokens," which are then processed by the Transformer. The Transformer uses a self-attention mechanism to compare tokens within the same image and across different views, allowing it to infer 3D structure from the input images.
Animation: To animate the 3D head, Avat3r uses cross-attention layers. These layers allow the image tokens to "attend" to an expression code ( $z_{exp}$ ). The expression code is processed by a multilayer perceptron (MLP) to generate a sequence of expression tokens ( $f_{exp}$ ). This tells the system how to modify the 3D Gaussians to create the desired facial expression.

$f_{exp} = MLP(z_{exp})$
Upsampling and Skip Connections: The image tokens are then upsampled to the original input resolution, and skip connections are added from the DUSt3R position maps and the original input images. The skip connections help to preserve fine details and ensure that the 3D Gaussians are placed in the correct locations.
Gaussian Selection: The confidence maps from DUSt3R are used to decide which pixels in the generated Gaussian attribute maps should actually spawn a 3D Gaussian. Pixels with low confidence (below a threshold $\tau$ ) are discarded. This helps to remove artifacts and ensures that the number of Gaussians is adjusted to the person being modeled.

$\mathcal{G} = \{M[x, y] : I^{conf}[x, y] > \tau\}$
Rendering: The final set of 3D Gaussians ( $\mathcal{G}$ ) can then be rendered from any desired viewpoint using a differentiable rasterizer ( $\mathcal{R}$ ). This means that the system can create images of the 3D head avatar from any angle.

$I^{nv} = \mathcal{R}(\mathcal{G}, \pi^{nv})$

3. Training

Avat3r is trained using a combination of loss functions that encourage the generated avatars to be both realistic and accurate:

Photometric Losses: These losses measure the difference between the rendered images and the ground truth images. Avat3r uses both $L1$ loss ( $\mathcal{L}_{l1}$ ) and Structural Similarity Index (SSIM) loss ( $\mathcal{L}_{ssim}$ ). $L1$ loss measures the absolute difference between the pixel values, while SSIM loss measures the structural similarity between the images.

$\mathcal{L}_{l1} = \Vert I^{nv} - I^{gt} \Vert_1$

$\mathcal{L}_{ssim} = SSIM(I^{nv}, I^{gt})$
Perceptual Losses: These losses encourage the emergence of more high-frequency details in the generated avatars. Avat3r uses Learned Perceptual Image Patch Similarity (LPIPS) loss ( $\mathcal{L}_{lpips}$ ).

$\mathcal{L}_{lpips} = LPIPS(I^{nv}, I^{gt})$

The final loss function is a weighted sum of these individual losses:

$\mathcal{L} = \lambda_{l1} \mathcal{L}_{l1} + \lambda_{ssim} \mathcal{L}_{ssim} + \lambda_{lpips} \mathcal{L}_{lpips}$

$\lambda_{l1}$ is the weight for the $L1$ loss.
$\lambda_{ssim}$ is the weight for the SSIM loss.
$\lambda_{lpips}$ is the weight for the LPIPS loss.

4. Results

The paper compares Avat3r to other state-of-the-art methods for 3D head avatar creation. The results show that Avat3r performs competitively in both few-input and single-input scenarios. In particular, Avat3r produces 3D head avatars that better resemble the person in the input images and produces smoother video renderings.

5. Ablation Studies

The paper includes a number of ablation studies that analyze the importance of different components of the Avat3r pipeline. These studies show that:

DUSt3R position maps are important for geometric fidelity.
Sapiens feature maps help to produce sharper predictions.
Training with inconsistent input images improves the robustness of the model.

6. Applications

The paper demonstrates a number of applications of Avat3r, including:

Animating a randomly sampled 3D head
Animating an image generated by a text-to-image diffusion model
Animating a picture of a Greek bust

These results show that Avat3r can be used in a variety of casual situations, even with out-of-domain examples.

7. Limitations

The paper also discusses some limitations of Avat3r. For example, the current pipeline for inferring avatars from a single image requires a 3D Generative Adversarial Network (GAN) for 3D lifting, which can introduce errors. Additionally, the system requires camera poses during inference, and the current pipeline does not provide control over the lighting.

In summary, Avat3r is a novel method for creating high-quality, animatable 3D head avatars from just a few images. The method combines recent advancements in large reconstruction models with powerful foundation models and demonstrates strong performance in a variety of scenarios.

PDF Markdown

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Related Papers

Find Related Papers

Authors (5)

GitHub

Avat3r: Large Animatable Gaussian Reconstruction Model for High-fidelity 3D Head Avatars

Tweets

https://twitter.com/MattNiessner/status/1895403960988483914

https://twitter.com/semisance/status/1895422163576107046

https://twitter.com/ntgenai/status/1898459195021881527

YouTube

Show All Videos