- The paper introduces Avat3r, a novel method that reconstructs high-fidelity, animatable 3D head avatars from a few images using Gaussian predictions.
- The paper leverages DUSt3R and Sapiens models with a Vision Transformer backbone to extract accurate position and feature maps for robust 3D modeling.
- The paper demonstrates competitive performance in various input scenarios, enabling smoother, cost-effective avatar creation on consumer-grade GPUs.
This paper introduces Avat3r, a new method for creating high-quality, animatable 3D head avatars from just a few images of a person's head. This is important because traditional methods require expensive studio setups and a lot of computation, making it difficult for regular users to create their own digital avatars. Avat3r simplifies this process, allowing users to create a 3D avatar from a few smartphone photos in minutes using a single, consumer-grade graphics processing unit (GPU).
Here’s a breakdown of how Avat3r works:
1. Background and Challenges
Creating realistic 3D head avatars from a small number of images is difficult for a few reasons:
- Sparse 3D Reconstruction: When you only have a few images, parts of the head (like the inside of the mouth or the sides) might not be visible in any of the pictures, making it hard to create a complete 3D model.
- Face Animation: Animating the face to show different expressions requires the system to understand how the face should move, even if it has never seen that particular person making those expressions.
- Robust Reconstruction: The input images might not be perfect. For example, the person might have moved slightly while the pictures were being taken, leading to inconsistencies.
Avat3r is designed to address all of these challenges simultaneously.
2. Method Overview
Avat3r uses a Vision Transformer architecture to predict 3D Gaussians for each pixel in the input images. Instead of starting with a basic 3D head shape and adjusting it, Avat3r predicts a set of 3D Gaussians directly from the images. These Gaussians are small 3D shapes that, when combined, create the final 3D head avatar.
Here's a step-by-step breakdown:
- Input: The system takes a few images of a person’s head (I), their corresponding camera parameters (π), and a code (zexp) that describes the desired facial expression as input. The camera parameters tell the system where the camera was when each picture was taken.
I={I1,...,IV} are the V input images.
π are their corresponding camera parameters.
zexp is a code describing the desired facial expression.
- Foundation Models: Avat3r uses two pre-trained models to help with the reconstruction:
DUSt3R: This model creates "position maps" (Ipos) for each input image. These maps provide a rough estimate of the 3D position of each pixel in the image, giving the 3D Gaussians a starting point. DUSt3R also gives a confidence map (Iconf), indicating how reliable the position estimate is for each pixel.
Ipos,Iconf=Dust3r(I,π)
Sapiens: This model creates feature maps (Ifeat) for each image. These feature maps capture important information about the image, such as the shape and texture of the face, which helps the system match features across different views.
Ifeat=Sapiens(I)
- Vision Transformer Backbone: The core of Avat3r is a Vision Transformer. The input images, position maps, and feature maps are converted into a series of "tokens," which are then processed by the Transformer. The Transformer uses a self-attention mechanism to compare tokens within the same image and across different views, allowing it to infer 3D structure from the input images.
- Animation: To animate the 3D head, Avat3r uses cross-attention layers. These layers allow the image tokens to "attend" to an expression code (zexp). The expression code is processed by a multilayer perceptron (MLP) to generate a sequence of expression tokens (fexp). This tells the system how to modify the 3D Gaussians to create the desired facial expression.
fexp=MLP(zexp)
- Upsampling and Skip Connections: The image tokens are then upsampled to the original input resolution, and skip connections are added from the DUSt3R position maps and the original input images. The skip connections help to preserve fine details and ensure that the 3D Gaussians are placed in the correct locations.
- Gaussian Selection: The confidence maps from DUSt3R are used to decide which pixels in the generated Gaussian attribute maps should actually spawn a 3D Gaussian. Pixels with low confidence (below a threshold τ) are discarded. This helps to remove artifacts and ensures that the number of Gaussians is adjusted to the person being modeled.
G={M[x,y]:Iconf[x,y]>τ}
- Rendering: The final set of 3D Gaussians (G) can then be rendered from any desired viewpoint using a differentiable rasterizer (R). This means that the system can create images of the 3D head avatar from any angle.
Inv=R(G,πnv)
3. Training
Avat3r is trained using a combination of loss functions that encourage the generated avatars to be both realistic and accurate:
- Photometric Losses: These losses measure the difference between the rendered images and the ground truth images. Avat3r uses both L1 loss (Ll1) and Structural Similarity Index (SSIM) loss (Lssim). L1 loss measures the absolute difference between the pixel values, while SSIM loss measures the structural similarity between the images.
Ll1=∥Inv−Igt∥1
Lssim=SSIM(Inv,Igt)
- Perceptual Losses: These losses encourage the emergence of more high-frequency details in the generated avatars. Avat3r uses Learned Perceptual Image Patch Similarity (LPIPS) loss (Llpips).
Llpips=LPIPS(Inv,Igt)
The final loss function is a weighted sum of these individual losses:
L=λl1Ll1+λssimLssim+λlpipsLlpips
- λl1 is the weight for the L1 loss.
- λssim is the weight for the SSIM loss.
- λlpips is the weight for the LPIPS loss.
4. Results
The paper compares Avat3r to other state-of-the-art methods for 3D head avatar creation. The results show that Avat3r performs competitively in both few-input and single-input scenarios. In particular, Avat3r produces 3D head avatars that better resemble the person in the input images and produces smoother video renderings.
5. Ablation Studies
The paper includes a number of ablation studies that analyze the importance of different components of the Avat3r pipeline. These studies show that:
- DUSt3R position maps are important for geometric fidelity.
- Sapiens feature maps help to produce sharper predictions.
- Training with inconsistent input images improves the robustness of the model.
6. Applications
The paper demonstrates a number of applications of Avat3r, including:
These results show that Avat3r can be used in a variety of casual situations, even with out-of-domain examples.
7. Limitations
The paper also discusses some limitations of Avat3r. For example, the current pipeline for inferring avatars from a single image requires a 3D Generative Adversarial Network (GAN) for 3D lifting, which can introduce errors. Additionally, the system requires camera poses during inference, and the current pipeline does not provide control over the lighting.
In summary, Avat3r is a novel method for creating high-quality, animatable 3D head avatars from just a few images. The method combines recent advancements in large reconstruction models with powerful foundation models and demonstrates strong performance in a variety of scenarios.