One-shot Implicit Animatable Avatars with Model-based Priors (2212.02469v4)

Published 5 Dec 2022 in cs.CV, cs.AI, and cs.GR

Abstract: Existing neural rendering methods for creating human avatars typically either require dense input signals such as video or multi-view images, or leverage a learned prior from large-scale specific 3D human datasets such that reconstruction can be performed with sparse-view inputs. Most of these methods fail to achieve realistic reconstruction when only a single image is available. To enable the data-efficient creation of realistic animatable 3D humans, we propose ELICIT, a novel method for learning human-specific neural radiance fields from a single image. Inspired by the fact that humans can effortlessly estimate the body geometry and imagine full-body clothing from a single image, we leverage two priors in ELICIT: 3D geometry prior and visual semantic prior. Specifically, ELICIT utilizes the 3D body shape geometry prior from a skinned vertex-based template model (i.e., SMPL) and implements the visual clothing semantic prior with the CLIP-based pretrained models. Both priors are used to jointly guide the optimization for creating plausible content in the invisible areas. Taking advantage of the CLIP models, ELICIT can use text descriptions to generate text-conditioned unseen regions. In order to further improve visual details, we propose a segmentation-based sampling strategy that locally refines different parts of the avatar. Comprehensive evaluations on multiple popular benchmarks, including ZJU-MoCAP, Human3.6M, and DeepFashion, show that ELICIT has outperformed strong baseline methods of avatar creation when only a single image is available. The code is public for research purposes at https://huangyangyi.github.io/ELICIT/.

Authors (9)

Yangyi Huang (7 papers)
Hongwei Yi (28 papers)
Weiyang Liu (83 papers)
Haofan Wang (33 papers)
Boxi Wu (36 papers)
Wenxiao Wang (63 papers)
Binbin Lin (50 papers)
Debing Zhang (29 papers)
Deng Cai (181 papers)

Citations (26)

View on Semantic Scholar

Summary

The paper proposes a one-shot, implicit framework that combines NeRF, SMPL, and CLIP to accurately reconstruct and animate 3D human avatars from a single image.
It employs a segmentation-based sampling strategy and diverse regularization techniques to enhance detail recovery in occluded regions and critical body parts.
Experimental results on datasets like Human3.6M demonstrate superior PSNR, SSIM, and LPIPS scores compared to state-of-the-art methods, highlighting its practical applications.

Review of ELICIT: Free-viewpoint Human Motion Videos from a Single Image

The paper presents ELICIT, a method for creating animatable 3D human avatars from a single image, advancing neural rendering by addressing key challenges in data efficiency and input sparseness. The work introduces a novel framework leveraging neural radiance fields (NeRF) tailored specifically for human rendering, focusing on achieving high-quality outputs from minimal input data. ELICIT distinguishes itself by utilizing a combination of a skinned vertex-based template model (SMPL) and a vision-LLM (CLIP) to infer body geometry and visual semantics, enabling the synthesis of realistic viewpoints and poses from constrained data input scenarios.

Technical Contributions

NeRF-Based Representation: ELICIT constructs an animatable NeRF, diverging from traditional NeRF applications that rely on dense and well-controlled multi-view inputs. This approach optimizes the neural representation using a single image, emphasizing the joint handling of geometry reconstruction and texture detail recovery in occluded regions.
Utilization of SMPL Model: The use of the SMPL model provides a geometric prior, imposing constraints that guide the implicit model's understanding of human body shapes. This integration ensures that pose synthesis remains accurate while facilitating complete geometry understanding for regions not visible in the input image.
CLIP-Based Semantic Priors: The framework leverages pre-trained CLIP models to encode and guide the semantic learning required for visually plausible texture synthesis, particularly in unseen regions. This aspect underlines the framework's capability of using high-level, latent space regularizations to fill in data gaps present in single-image scenarios.
Segmentation-Based Sampling Strategy: The approach employs a novel segmentation-based sampling strategy, enhancing the detail recovery of segmented human parts through patch-based optimization. This is aimed at improving visual fidelity in critical areas such as faces and hands, which are often prone to degradation under sparse input conditions.
Combination of Regularization Techniques: ELICIT incorporates a variety of loss components, such as CLIP-based similarity measures and soft geometric constraints, which work collaboratively to prevent typical degeneration problems encountered in single-image reconstructions.

Experimental Analysis

The paper reports exhaustive evaluations using notable datasets including ZJU-MoCAP, Human3.6M, and DeepFashion, demonstrating significant improvements over current state-of-the-art methods like Neural Body, Animatable NeRF, and Neural Human Performer across several metrics, including PSNR, SSIM, and LPIPS. The quantitative and qualitative results depict ELICIT's efficiency in achieving perceptually realistic renderings, especially in recovering detailed and coherent geometric structures from sparse inputs.

Implications and Future Outlook

The methodologies proposed by ELICIT have substantial implications for AR/VR applications and other computing sectors where 3D human renderings are crucial but data constraints are prevalent. Practically, this paves the way for broader applications in customized avatar creation, gaming engines, and virtual collaboration platforms without the need for extensive capture setups.

Theoretically, the synergistic use of geometric and semantic priors heralds a promising direction for future neural rendering research. There is potential in further extending this work through integration with more complex template models like SMPL-X or exploring generative models that could unify text, audio, and image inputs for richer, more context-aware avatar generation.

In conclusion, ELICIT represents a significant stride in neural avatar rendering, focusing on optimizing performance in data-scarce environments while offering a high degree of detail and animation fidelity, presenting opportunities for widespread and accessible 3D content creation. The extensions proposed for semantic and geometric priors, as well as improvements in implicit representation, propose a fertile ground for advancements in this domain.

One-shot Implicit Animatable Avatars with Model-based Priors (2212.02469v4)

Summary

Review of ELICIT: Free-viewpoint Human Motion Videos from a Single Image

Technical Contributions

Experimental Analysis

Implications and Future Outlook

GitHub

YouTube

One-shot Implicit Animatable Avatars with Model-based Priors (2212.02469v4)

Summary

Review of ELICIT: Free-viewpoint Human Motion Videos from a Single Image

Technical Contributions

Experimental Analysis

Implications and Future Outlook

Related Papers

GitHub

YouTube