Capture, Canonicalize, Splat: Zero-Shot 3D Gaussian Avatars from Unstructured Phone Images (2510.14081v2)

Published 15 Oct 2025 in cs.CV and cs.GR

Abstract: We present a novel, zero-shot pipeline for creating hyperrealistic, identity-preserving 3D avatars from a few unstructured phone images. Existing methods face several challenges: single-view approaches suffer from geometric inconsistencies and hallucinations, degrading identity preservation, while models trained on synthetic data fail to capture high-frequency details like skin wrinkles and fine hair, limiting realism. Our method introduces two key contributions: (1) a generative canonicalization module that processes multiple unstructured views into a standardized, consistent representation, and (2) a transformer-based model trained on a new, large-scale dataset of high-fidelity Gaussian splatting avatars derived from dome captures of real people. This "Capture, Canonicalize, Splat" pipeline produces static quarter-body avatars with compelling realism and robust identity preservation from unstructured photos.

Summary

The paper introduces a novel zero-shot pipeline that captures, canonicalizes, and splats phone images into 3D Gaussian avatars, addressing geometric inconsistency and identity loss.
The method employs a transformer-based reconstruction model trained on a high-fidelity dataset of 3.2K avatars and 5M renders to preserve fine details and authentic human features.
Quantitative ablation studies demonstrate improved PSNR (33.5) and enhanced realism, setting a new benchmark for photorealistic 3D human digitization from minimal input.

Zero-Shot 3D Gaussian Avatars from Unstructured Phone Images: The "Capture, Canonicalize, Splat" Pipeline

Introduction and Motivation

The paper introduces a zero-shot pipeline for generating hyperrealistic, identity-preserving 3D avatars from a sparse set of unstructured phone images. The approach addresses two persistent challenges in 3D human digitization: (1) geometric inconsistency and identity loss in single-view or poorly calibrated multi-view reconstructions, and (2) lack of high-frequency, person-specific details in models trained on synthetic datasets. The proposed solution leverages a generative canonicalization module and a transformer-based reconstruction model, both trained on a novel dataset of high-fidelity Gaussian splatting avatars derived from dome captures of real individuals.

Figure 1: The "Capture, Canonicalize, Splat" pipeline: unstructured phone images are canonicalized and lifted into a hyperrealistic 3D Gaussian splatting avatar.

High-Fidelity Human Avatar Dataset

A central contribution is the construction of a large-scale dataset of person-specific Gaussian splatting avatars. Unlike synthetic datasets (e.g., RenderPeople, Objaverse), this dataset is derived from calibrated multi-view dome captures, enabling the retention of fine details such as skin microgeometry and complex hair structures. The dataset comprises 3.2K avatars and 5M rendered images, supporting both canonical multi-view and simulated unstructured captures. This data generation strategy provides strong priors for realistic human appearance and precise 3D supervision.

Figure 2: Workflow for creating the high-fidelity dataset: dome capture, Gaussian avatar optimization, and large-scale rendering for training.

Generative Canonicalization Module

The first stage of the pipeline is a generative canonicalization module that processes $N$ unstructured phone images (typically $N=4$ : front, back, left, right) and synthesizes $M$ 3D-consistent, canonicalized views with fixed camera parameters. The module aggregates identity information, enforces 3D consistency, and synthesizes novel views to fill in missing information. This step is critical for robust identity preservation and geometric coherence, as single-view conditioning leads to hallucinations and identity drift, especially for occluded regions.

Figure 3: Multi-view conditioning is essential for identity preservation; single-view reconstructions fail for unseen areas.

Multi-View to 3D Gaussian Splatting Reconstruction

The canonicalized views are then processed by a transformer-based Large Reconstruction Model, which predicts the parameters of $K$ 3D Gaussians (position, covariance, color, opacity). The model is trained end-to-end on the high-fidelity dataset, using a composite loss: L1 photometric, LPIPS perceptual, alpha mask, and scale regularization. This enables the model to capture intricate details and avoid degenerate Gaussian distributions. The pipeline demonstrates that both high-quality training data and multi-view inputs are essential for high PSNR and realistic reconstructions.

Figure 4: Synthetic training data (e.g., RenderPeople) fails to capture realism, resulting in identity shift and over-smoothed appearance.

Quantitative and Qualitative Results

Ablation studies show that the full pipeline (multi-view input, high-fidelity dataset) achieves a PSNR of 33.5, outperforming models trained on synthetic data or using single-view inputs (PSNR ≤ 27.5). Qualitative results confirm superior identity preservation and realism, with avatars exhibiting fine skin and hair details absent in prior work.

Implications and Future Directions

The pipeline enables casual users to generate high-quality 3D avatars from a handful of phone images, removing the need for calibrated camera setups or specialized hardware. The use of Gaussian splatting allows for efficient, high-fidelity rendering. The approach sets a new standard for realism and identity fidelity in 3D avatar generation, with potential applications in telepresence, virtual reality, and digital fashion.

Future research may extend the pipeline to dynamic avatars, full-body reconstructions, and relightable models. Further improvements could involve self-supervised learning from in-the-wild data, domain adaptation for diverse demographics, and integration with generative text-to-3D models.

Conclusion

The "Capture, Canonicalize, Splat" pipeline represents a significant advance in zero-shot 3D avatar generation from unstructured phone images. By leveraging a high-fidelity dataset and multi-view canonicalization, the method achieves robust identity preservation and hyperrealism, outperforming prior approaches reliant on synthetic data or single-view inputs. The work provides a scalable solution for photorealistic digital human creation and opens new avenues for research in 3D generative modeling and avatar personalization.