InfiniHumanData: 3D Avatar Dataset
- InfiniHumanData is a large-scale, fully automated, multi-modal dataset for generating realistic 3D human avatars with detailed text descriptions, multi-view images, and SMPL body parameters.
- Its pipeline integrates advanced vision-language models and diffusion techniques to create over 111,000 identities with precise control over diverse attributes such as ethnicity, age, and clothing styles.
- The framework supports applications in gaming, VR, digital fashion, and animation while demonstrating high fidelity through rigorous user studies and objective metrics.
InfiniHumanData is a large-scale, fully automatic, multi-modal dataset designed to address the challenge of generating realistic, richly annotated 3D human avatars with precise controllability across broad attribute ranges, including ethnicity, age, body shapes, clothing styles, and more. Developed as part of the InfiniHuman framework, InfiniHumanData leverages foundation vision-language and image generation models to produce 111,000+ identities that are annotated with multi-granularity text descriptions, multi-view RGB imagery, dedicated clothing assets, and SMPL body-shape parameters. This scalable approach breaks the limitations of manual human scan datasets, enabling theoretically unbounded 3D human data creation (Xue et al., 13 Oct 2025).
1. Dataset Characteristics and Multi-Modal Annotations
InfiniHumanData comprises 111,000 individual identities carefully constructed to ensure unprecedented diversity in terms of facial features, skin tones, hair types, age groups, and wardrobe styles. Each identity is accompanied by:
- Multi-granularity text descriptions: Descriptions range from up to 40 words (fine, character-driven detail) to concise 5-word summaries, structured at ten granularity levels for various conditioning tasks.
- Multi-view orthographic RGB images: Full-body and head images under uniform lighting, generated from fine-tuned FLUX text-to-image models with LoRA adapters for scan-like rendering. Four body views and four head views are created per identity.
- Clothing asset images: Specific garment images extracted via a dedicated Virtual-TryOff procedure employing a reverse try-on protocol, allowing for fine control over avatar attire.
- SMPL body-shape parameters and jigsaw keypoints: Precise geometric representations to support animation and re-targeting, including explicit face and hand articulation.
A summary of core data modalities is provided below:
| Modality | Description granularity | Control parameters |
|---|---|---|
| Text description | 10 granularity levels | Detail (40w) → Summary (5w) |
| RGB images (body/head) | Orthographic, multi-view | FLUX + LoRA, consistent conditions |
| Clothing assets | Clean garment regions | Virtual-TryOff, GPT-4o selection |
| SMPL params/keypoints | 3D pose and shape | NLF regression + 2D orthographic refine |
This annotation strategy provides rigorous supervision for avatar generation, conditional synthesis, and benchmarking.
2. Fully Automatic Data Generation Pipeline
InfiniHumanData is constructed via a multi-stage, automated pipeline integrating foundation models and fine-tuned generative components:
- Text description synthesis: Captioning leverages protocols akin to Trellis, followed by GPT-4o generation and summarization. In-context prompting yields diverse, high-fidelity character narratives spanning cultural and physical attributes.
- Text-to-image generation: FLUX, adapted by LoRA adapters using thousands of real scan renderings, ensures orthographic and photorealistic outputs suitable for SMPL pose inference and multi-view consistency. Uniform lighting and camera parallelism eliminate perspective distortions.
- Virtual-TryOff: A reverse try-on method with OminiControl, termed Instruct-Virtual-TryOff, selectively extracts and cleans garment images. A flow-matching objective
guarantees garment segmentation fidelity for later avatar synthesis.
- Garment selection and rejection: Several garment outputs are generated, with rejection of suboptimal candidates coordinated by GPT-4o using explicit prompts for garment features.
- Monocular body fitting: SMPL parameters regressed via NLF are refined by aligning the orthographic SMPL joint projections with OpenPose 2D keypoints, using a weighted reprojection loss:
- Multi-view diffusion: A multi-view diffusion model introduces row-wise attention under orthographic constraints, reconstructing consistent body and head images:
This rigorously modular pipeline enables scalable, personalized, and accurately annotated human data synthesis at minimal cost.
3. InfiniHumanGen: Diffusion-Based Avatar Synthesis
Building on InfiniHumanData, the InfiniHumanGen generative framework models the joint conditional probability
for photorealistic and controllable 3D avatar generation. It includes two principal model variants:
- Gen-Schnell: A fast, multi-view 2D diffusion generator producing Gaussian splat-based 3D avatars. Conditioning signals (text, SMPL normal maps, clothing asset codes) are fused via VAE encoding. A single denoising step recovers clean images as
followed by 3D-GS rendering for geometric consistency.
- Gen-HRes: A high-resolution pipeline based on OminiControl2, employing a multi-image-to-image translation architecture. Flow-matching objectives extend to text, SMPL, and garment conditions. By fixing Gaussian noise, fine-grained text-based edits (e.g., accessory or color changes) preserve identity invariants.
Gen-Schnell achieves generation in ~12 seconds per identity, whereas Gen-HRes delivers high-fidelity avatars in ~4 minutes, suitable for applications demanding photorealistic detail.
4. Quantitative and Qualitative Validation
Validation comprises both user paper and objective metrics:
- User paper: 765 votes for InfiniHumanData images versus 746 for authentic scan renderings in a paired discrimination test, demonstrating that synthetic identities are indistinguishable from real scans.
- Objective metrics: FID and CLIP scores consistently show superior performance compared to existing methods in both appearance quality and text description alignment.
- Editing capability: Identity preservation during text-based edits was validated via Gen-HRes, with controlled variation in accessories and garments.
This empirical evaluation underlines the practical fidelity and reliability of the dataset and generative outputs.
5. Applications and Significance
InfiniHumanData and InfiniHumanGen provide direct utility in:
- Gaming & VR: Rapid, customizable avatar creation for immersive environments.
- Digital Fashion: Realistic virtual try-on with garment asset control.
- Animation & Telepresence: SMPL-rigged avatars supporting motion capture re-application.
- Figurine fabrication: Watertight mesh generation suitable for 3D printing.
The practicality and scalability afforded by foundation model distillation enable affordable and inclusive synthesis for creative, research, and industrial domains.
6. Limitations and Future Directions
Gen-HRes currently incurs slower avatar generation compared to Gen-Schnell; ongoing work aims to accelerate high-resolution synthesis. Inclusion of famous identities is omitted by GPT-4o due to privacy constraints. Mesh reconstruction improvements for self-occluded regions are identified as a direction for further research. Comprehensive public release of code, pipeline, and all data modalities is planned at https://yuxuan-xue.com/infini-human.
A plausible implication is that the InfiniHumanData paradigm, integrating vision-language and diffusion-based synthesis with foundation model distillation, marks a significant advance in the scalable, precise, and diverse generation of 3D human data, supporting both academic and industrial progress in controllable avatar technologies (Xue et al., 13 Oct 2025).