InfiniHumanData: 3D Avatars Dataset
- InfiniHumanData is a large-scale, multi-modal dataset comprising 111,000 automatically generated 3D human identities annotated with multi-granularity text, RGB images, clothing assets, and SMPL parameters.
- The pipeline integrates foundation models like GPT-4o, FLUX, and OminiControl to generate consistent, richly labeled data that supports precise 3D reconstruction and avatar control.
- The dataset enables practical applications in VR, gaming, digital fashion, and re-animation by allowing fine-grained manipulation of body shape, pose, and garment style.
InfiniHumanData is a fully automatic, large-scale multi-modal dataset designed to enable the generation of realistic, diverse, and precisely controllable 3D human avatars. Constructed via a synergistic distillation of foundation models spanning vision-language, image generation, and body fitting, InfiniHumanData comprises 111,000 identities with unprecedented diversity, each annotated across multiple modalities. The dataset was specifically developed to overcome the expense and limitations of traditional human scan datasets, enabling scalable, richly annotated data at minimal cost. The underlying pipeline integrates multi-granularity text descriptions, orthographic multi-view RGB imagery, detailed clothing images derived from image-to-image translation, and SMPL body shape parameters. These assets support the creation and controllable manipulation of avatars using advanced diffusion-based generative models.
1. Dataset Composition and Annotation Modalities
InfiniHumanData contains 111,000 automatically generated 3D human identities, each annotated across four principal modalities:
- Multi-granularity text descriptions: Each identity is labeled with textual descriptions ranging in length from approximately 40 words (highly detailed) down to as few as 5 words (minimal). This multi-resolution annotation enables both fine-grained and coarse semantic conditioning.
- Orthographic multi-view RGB images: All identities include multiple orthographic full-body and head RGB images rendered under uniform lighting to facilitate compatibility with 3D reconstruction tasks.
- Clothing modality: Clothing images are generated using a process termed “Virtual-TryOff,” which reverses the virtual try-on paradigm to extract clean garment assets from full-body imagery, utilizing modified image-to-image translation architectures.
- SMPL body shape parameters and keypoints: Detailed body shape, pose, and facial/hand landmarks are derived via two-stage monocular body fitting, enabling precise 3D parametric body modeling.
The dataset exhibits broad coverage with respect to ethnicity, age (including children), clothing style, hair type, and body shape. Diversity is achieved by employing GPT-4o to produce varied and multi-granularity text descriptions from human scan datasets, further summarized at ten granularity levels.
Modality | Annotation Type | Generation/Extraction Method |
---|---|---|
Text Descriptions | Multi-granularity | GPT-4o protocol + summarization |
RGB Images | Multi-view, orthographic | FLUX text-to-image with LoRA fine-tuning |
Clothing Images | Garment-only, detailed | OminiControl + flow-matching loss |
Body Shape (SMPL) | Parameters + Keypoints | Monocular fitting then reprojection alignment |
2. Data Generation Pipeline
The pipeline underpinning InfiniHumanData coordinates the distillation of multiple foundation models via the following steps:
- Text annotation: Existing scan datasets (THuman2.1, CustomHuman, 2K2K) receive captions generated by protocols from Trellis and expanded with GPT-4o, which introduces variation and granularity in attributes such as ethnicity and clothing.
- Text-to-image generation: The FLUX model, fine-tuned with a LoRA adapter, synthesizes “scan-like” RGB images with orthographic viewpoints and consistent lighting, optimizing them for downstream 3D reconstruction.
- Clothing extraction: Garment images are isolated by a modified image-to-image translation module built on OminiControl, utilizing a flow-matching objective defined as:
where is a noisy garment image at step , is Gaussian noise, and provides conditional instructions.
- Sample selection: GPT-4o evaluates multiple garment candidates for each identity and selects the best match based on specified attributes.
- SMPL parameter fitting: Initial regression is performed using models such as NLF, followed by refinement to minimize reprojection error between SMPL and OpenPose-detected 2D joints:
- Multi-view diffusion: A specifically trained diffusion model generates high-resolution multi-view images. Specialized attention mechanisms, including row-wise attention across fixed epipolar geometries, enforce inter-view consistency.
3. Technological Innovations
Several technological advancements distinguish InfiniHumanData:
- Integration of foundation models: The pipeline coalesces capabilities from GPT-4o (structured annotation), FLUX (orthographic image synthesis), OminiControl (clothing extraction), and multi-view diffusion models.
- Generative pipelines: Two novel avatar creation pipelines—Gen-Schnell and Gen-HRes—enable rapid and photorealistic mesh generation with fine-grained controllability:
- Gen-Schnell: Utilizes multi-view 2D diffusion and Gaussian splatting for 3D avatar synthesis in approximately 12 seconds, with consistency enforced through:
where this is subsequently incorporated into a 3D Gaussian splatting module . - Gen-HRes: Leverages a multi-modal image-to-image translation framework (OminiControl2 + Flux-Dev) for high-resolution mesh generation (~4 minutes), supporting nuanced attribute adjustments.
- Joint conditioning: InfiniHumanGen is conditioned on a composite signal , allowing simultaneous control of text, body shape/pose, and garment style.
4. Evaluation, Performance, and Impact
Extensive evaluation demonstrates that Gen-Schnell and Gen-HRes both improve upon prior techniques with respect to visual quality, generation speed, and alignment of output with conditioning signals. Quantitative metrics include leading scores on multi-view consistency, CLIP Score, and FID, while user studies establish that generated identities are indistinguishable from scan renderings. These models set a new standard for controllable 3D human avatar synthesis.
Applications span digital avatar creation for VR, gaming, digital fashion (virtual try-on systems), social telepresence, re-animation using SMPL motion data, and physical 3D printing (figurine fabrication). The ability to alter attributes such as clothing or ethnicity while preserving identity supports new paradigms in personalized content creation.
5. Future Directions and Public Availability
InfiniHumanData, alongside the data generation pipeline and generative models (InfiniHumanGen: Gen-Schnell and Gen-HRes), is scheduled for public release via the project website (https://yuxuan-xue.com/infini-human). Proposed future research directions include:
- Advancing end-to-end high-resolution 3D generation (enhanced Gen-Schnell) to better resolve facial detail without increased inference time.
- Enriching the dataset with additional modalities (e.g., celebrity identities) annotated in a privacy-compliant manner using alternative LLMs.
- Improving 3D mesh reconstruction to mitigate occlusion artifacts through data-driven mesh refinement, moving beyond SMPL-driven volumetric carving.
InfiniHumanData constitutes a scalable, richly annotated resource that fundamentally addresses key limitations of previous human avatar datasets by exploiting automatic distillation of multiple foundation models. The resulting framework yields unprecedented speed, fidelity, diversity, and controllability in 3D human synthesis, supporting a wide range of downstream applications.