SHERF: Generalizable Human NeRF from a Single Image (2303.12791v2)

Published 22 Mar 2023 in cs.CV

Abstract: Existing Human NeRF methods for reconstructing 3D humans typically rely on multiple 2D images from multi-view cameras or monocular videos captured from fixed camera views. However, in real-world scenarios, human images are often captured from random camera angles, presenting challenges for high-quality 3D human reconstruction. In this paper, we propose SHERF, the first generalizable Human NeRF model for recovering animatable 3D humans from a single input image. SHERF extracts and encodes 3D human representations in canonical space, enabling rendering and animation from free views and poses. To achieve high-fidelity novel view and pose synthesis, the encoded 3D human representations should capture both global appearance and local fine-grained textures. To this end, we propose a bank of 3D-aware hierarchical features, including global, point-level, and pixel-aligned features, to facilitate informative encoding. Global features enhance the information extracted from the single input image and complement the information missing from the partial 2D observation. Point-level features provide strong clues of 3D human structure, while pixel-aligned features preserve more fine-grained details. To effectively integrate the 3D-aware hierarchical feature bank, we design a feature fusion transformer. Extensive experiments on THuman, RenderPeople, ZJU_MoCap, and HuMMan datasets demonstrate that SHERF achieves state-of-the-art performance, with better generalizability for novel view and pose synthesis.

Citations (51)

View on Semantic Scholar

Summary

The paper introduces SHERF, which reconstructs and animates 3D human models from a single image using a hierarchical feature extraction approach.
It fuses global, point-level, and pixel-aligned features with a transformer to capture detailed human shapes and textures.
Experimental evaluations show SHERF outperforms existing Human NeRF methods, achieving higher PSNR and SSIM scores with lower LPIPS values.

SHERF: Generalizable Human NeRF from a Single Image

The paper "SHERF: Generalizable Human NeRF from a Single Image" introduces SHERF, an innovative approach in neural radiance fields (NeRF) for the reconstruction and animation of 3D human models from a single 2D image. This capability is significant as it addresses shortcomings in existing Human NeRF methodologies, which often require multi-view images or monocular videos to produce high-quality 3D models. SHERF's contribution lies in its ability to reconstruct animatable 3D humans from just one input image, expanding the applicability of NeRF models to more practical and diverse real-world scenarios.

Methodology

SHERF is distinguished by its hierarchical feature extraction approach, incorporating global, point-level, and pixel-aligned features. These features are designed to capture both broad human shape and fine texture details, crucial for accurate high-fidelity reconstructions. The global features are derived from the input image's overall appearance and help fill in data gaps from missing observations. Point-level features, on the other hand, are extracted using SMPL vertices to provide a granular understanding of 3D human structures. Pixel-aligned features directly connect the 3D points to the input 2D image, ensuring that fine-grained details are not lost during reconstruction.

To integrate these diverse features effectively, SHERF utilizes a feature fusion transformer. This component manages the complex relationships between different feature types, leveraging their respective strengths for a comprehensive representation. The resulting fused features are decoded via NeRF to predict pixel colors and densities in novel views and poses, ensuring that synthesized images maintain coherence with the human form's intrinsic structure.

Experimental Results

Extensive evaluations across multiple datasets—THuman, RenderPeople, ZJU_MoCap, and HuMMan—demonstrate SHERF's superior performance in novel view and pose synthesis compared to state-of-the-art generalizable Human NeRF models such as NHP and MPS-NeRF. These results show that SHERF achieves higher PSNR and SSIM scores while maintaining lower LPIPS values, indicating improved accuracy and perceptual quality in reconstructions. This robust performance is consistent across various input angles, showcasing its adaptability to differing observational circumstances.

Implications and Future Directions

The SHERF model significantly advances the potential for integrating NeRF technology into practical applications like virtual and augmented reality, where efficient and accurate 3D human models are critical. Its ability to generalize from single image inputs could also reduce the resources and time needed for such tasks, marking it as a valuable tool for industries reliant on digital human representations.

Theoretically, SHERF's development underscores the importance of feature diversity and integration in neural representation learning. Future work could explore enhancing the feature fusion transformer to support even more complex feature interactions or expanding the hierarchical feature set to include additional detailed inputs, such as environmental or motion cues. Moreover, real-world adoption of SHERF would benefit from further refinement in handling occlusions and in developing methods to predict unseen parts of the subject with even greater accuracy.

Overall, SHERF represents a notable step forward in human-centric NeRF applications, providing a foundation for more adaptable and efficient human model reconstruction in both research and commercial settings.

PDF Markdown

Related Papers

GitHub

GitHub - skhu101/SHERF: Code for our ICCV'2023 paper "SHERF: Generalizable Human NeRF from a Single Image" (317 stars)