AdaHuman: Animatable Detailed 3D Human Generation with Compositional Multiview Diffusion

Published 30 May 2025 in cs.CV | (2505.24877v1)

Abstract: Existing methods for image-to-3D avatar generation struggle to produce highly detailed, animation-ready avatars suitable for real-world applications. We introduce AdaHuman, a novel framework that generates high-fidelity animatable 3D avatars from a single in-the-wild image. AdaHuman incorporates two key innovations: (1) A pose-conditioned 3D joint diffusion model that synthesizes consistent multi-view images in arbitrary poses alongside corresponding 3D Gaussian Splats (3DGS) reconstruction at each diffusion step; (2) A compositional 3DGS refinement module that enhances the details of local body parts through image-to-image refinement and seamlessly integrates them using a novel crop-aware camera ray map, producing a cohesive detailed 3D avatar. These components allow AdaHuman to generate highly realistic standardized A-pose avatars with minimal self-occlusion, enabling rigging and animation with any input motion. Extensive evaluation on public benchmarks and in-the-wild images demonstrates that AdaHuman significantly outperforms state-of-the-art methods in both avatar reconstruction and reposing. Code and models will be publicly available for research purposes.

Abstract PDF Upgrade to Chat

Summary

AdaHuman: Animatable Detailed 3D Human Generation with Compositional Multiview Diffusion

The paper presents AdaHuman, a novel framework designed for high-fidelity 3D avatar generation from a single image. The framework significantly advances current methodologies by addressing several limitations associated with existing image-to-3D avatar generation approaches. The core of the AdaHuman framework is composed of two pivotal innovations: a pose-conditioned 3D joint diffusion model and a compositional 3D Gaussian Splats (3DGS) refinement module. Together, these components substantially improve the quality, pose-variability, and animation-readiness of generated 3D avatars.

The framework employs a diffusion-based generation process, more specifically a pose-conditioned joint 3D diffusion model that simultaneously performs multi-view image synthesis and 3D reconstruction for arbitrary poses. This is achieved by conditioning the diffusion process on the desired pose, thus enabling the generation of animatable avatars in a standardized A-pose. This innovation not only resolves issues with self-occlusion but also enhances the avatar’s readiness for rigging and animation using any motion input.

A vital aspect of AdaHuman is the compositional 3DGS refinement module, which delivers detailed avatars through a dual-phase process. Initially, it refines individual body component images using the generated 3DGS avatars. The subsequent phase integrates these components employing a novel crop-aware camera ray map that ensures cohesion, while a visibility-aware composition scheme intelligently merges the reconstructed partial 3DGS, enhancing detail and eliminating floating artifacts.

The evaluation of AdaHuman underscores its superiority over state-of-the-art methods in avatar reconstruction and reposing tasks. On public benchmarks, it outperforms competitors in both fidelity and adaptability to novel poses. Particularly noteworthy is its ability to generate multi-view avatars with consistent high-quality detailing, enabled by the integration of 3D reconstruction at each step of the diffusion process.

The implications of AdaHuman extend beyond current practical applications like gaming, animation, and virtual reality, holding potential to influence future AI systems in domains such as virtual try-ons and telepresence. The framework's provision to synthesize avatars with minimal data inputs effectively demonstrates the robustness necessary for broader real-world deployment. Additionally, the innovation in utilizing Gaussian Splats for refined avatar detailing could prompt research towards more efficient and detail-preserving 3D representations.

Though AdaHuman achieves remarkable results, there remains room for enhancement, especially in terms of computational efficiency and resolution of finer detailing in certain complex scenarios. Future research could focus on refining compositional strategies to handle varied occlusions dynamically and explore the integration of simulation-based methods to handle facial expressions and non-rigid motion more intricately.

Overall, the AdaHuman framework introduces substantial improvements to the domain of 3D avatar generation, offering a significant step towards highly detailed and animatable avatar creation from limited visual inputs. It lays foundational work that future explorations into 3D synthesis and animation can effectively build upon.