Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 177 tok/s
Gemini 2.5 Pro 50 tok/s Pro
GPT-5 Medium 31 tok/s Pro
GPT-5 High 32 tok/s Pro
GPT-4o 93 tok/s Pro
Kimi K2 183 tok/s Pro
GPT OSS 120B 447 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

InfiniHumanData: 3D Avatar Dataset

Updated 24 October 2025
  • InfiniHumanData is a large-scale, fully automated, multi-modal dataset for generating realistic 3D human avatars with detailed text descriptions, multi-view images, and SMPL body parameters.
  • Its pipeline integrates advanced vision-language models and diffusion techniques to create over 111,000 identities with precise control over diverse attributes such as ethnicity, age, and clothing styles.
  • The framework supports applications in gaming, VR, digital fashion, and animation while demonstrating high fidelity through rigorous user studies and objective metrics.

InfiniHumanData is a large-scale, fully automatic, multi-modal dataset designed to address the challenge of generating realistic, richly annotated 3D human avatars with precise controllability across broad attribute ranges, including ethnicity, age, body shapes, clothing styles, and more. Developed as part of the InfiniHuman framework, InfiniHumanData leverages foundation vision-language and image generation models to produce 111,000+ identities that are annotated with multi-granularity text descriptions, multi-view RGB imagery, dedicated clothing assets, and SMPL body-shape parameters. This scalable approach breaks the limitations of manual human scan datasets, enabling theoretically unbounded 3D human data creation (Xue et al., 13 Oct 2025).

1. Dataset Characteristics and Multi-Modal Annotations

InfiniHumanData comprises 111,000 individual identities carefully constructed to ensure unprecedented diversity in terms of facial features, skin tones, hair types, age groups, and wardrobe styles. Each identity is accompanied by:

  • Multi-granularity text descriptions: Descriptions range from up to 40 words (fine, character-driven detail) to concise 5-word summaries, structured at ten granularity levels for various conditioning tasks.
  • Multi-view orthographic RGB images: Full-body and head images under uniform lighting, generated from fine-tuned FLUX text-to-image models with LoRA adapters for scan-like rendering. Four body views and four head views are created per identity.
  • Clothing asset images: Specific garment images extracted via a dedicated Virtual-TryOff procedure employing a reverse try-on protocol, allowing for fine control over avatar attire.
  • SMPL body-shape parameters and jigsaw keypoints: Precise geometric representations to support animation and re-targeting, including explicit face and hand articulation.

A summary of core data modalities is provided below:

Modality Description granularity Control parameters
Text description 10 granularity levels Detail (40w) → Summary (5w)
RGB images (body/head) Orthographic, multi-view FLUX + LoRA, consistent conditions
Clothing assets Clean garment regions Virtual-TryOff, GPT-4o selection
SMPL params/keypoints 3D pose and shape NLF regression + 2D orthographic refine

This annotation strategy provides rigorous supervision for avatar generation, conditional synthesis, and benchmarking.

2. Fully Automatic Data Generation Pipeline

InfiniHumanData is constructed via a multi-stage, automated pipeline integrating foundation models and fine-tuned generative components:

  • Text description synthesis: Captioning leverages protocols akin to Trellis, followed by GPT-4o generation and summarization. In-context prompting yields diverse, high-fidelity character narratives spanning cultural and physical attributes.
  • Text-to-image generation: FLUX, adapted by LoRA adapters using thousands of real scan renderings, ensures orthographic and photorealistic outputs suitable for SMPL pose inference and multi-view consistency. Uniform lighting and camera parallelism eliminate perspective distortions.
  • Virtual-TryOff: A reverse try-on method with OminiControl, termed Instruct-Virtual-TryOff, selectively extracts and cleans garment images. A flow-matching objective

LVToFF(θ)=Et,ϵvθ(xt,I(vton),e(text),t)(ϵI(cloth))2L_\text{VToFF}(\theta) = \mathbb{E}_{t, \epsilon}\big\|v_\theta(x_t, I^{(\text{vton})}, e^{(\text{text})}, t) - (\epsilon - I^{(\text{cloth})})\big\|^2

guarantees garment segmentation fidelity for later avatar synthesis.

  • Garment selection and rejection: Several garment outputs are generated, with rejection of suboptimal candidates coordinated by GPT-4o using explicit prompts for garment features.
  • Monocular body fitting: SMPL parameters regressed via NLF are refined by aligning the orthographic SMPL joint projections with OpenPose 2D keypoints, using a weighted reprojection loss:

Lreproj(θ)=kwkπortho(Jk(SMPL(θ,β)))JkOpenPose2L_\text{reproj}(\theta) = \sum_k w_k \big\|\pi_\text{ortho}(J_k(\text{SMPL}(\theta, \beta))) - J_k^\text{OpenPose}\big\|^2

  • Multi-view diffusion: A multi-view diffusion model introduces row-wise attention under orthographic constraints, reconstructing consistent body and head images:

LMVD(θ)=Et,ϵp{body,head}ϵθ(xtp,I(in),I(SMPL),t)ϵ2L_\text{MVD}(\theta) = \mathbb{E}_{t,\epsilon}\sum_{p \in \{\text{body}, \text{head}\}} \big\| \epsilon_\theta(x_t^p, I^{(\text{in})}, I^{(\text{SMPL})}, t) - \epsilon \big\|^2

This rigorously modular pipeline enables scalable, personalized, and accurately annotated human data synthesis at minimal cost.

3. InfiniHumanGen: Diffusion-Based Avatar Synthesis

Building on InfiniHumanData, the InfiniHumanGen generative framework models the joint conditional probability

P(yctext,cSMPL,ccloth)P(y | c^\text{text}, c^\text{SMPL}, c^\text{cloth})

for photorealistic and controllable 3D avatar generation. It includes two principal model variants:

  • Gen-Schnell: A fast, multi-view 2D diffusion generator producing Gaussian splat-based 3D avatars. Conditioning signals (text, SMPL normal maps, clothing asset codes) are fused via VAE encoding. A single denoising step recovers clean images as

x~0=1αˉt(xt1αˉtϵθ(xt,ctext,cSMPL,ccloth,t))\tilde{x}_0 = \frac{1}{\sqrt{\bar{\alpha}_t}} \left( x_t - \sqrt{1 - \bar{\alpha}_t}\, \epsilon_\theta(x_t, c^\text{text}, c^\text{SMPL}, c^\text{cloth}, t) \right)

followed by 3D-GS rendering for geometric consistency.

  • Gen-HRes: A high-resolution pipeline based on OminiControl2, employing a multi-image-to-image translation architecture. Flow-matching objectives extend to text, SMPL, and garment conditions. By fixing Gaussian noise, fine-grained text-based edits (e.g., accessory or color changes) preserve identity invariants.

Gen-Schnell achieves generation in ~12 seconds per identity, whereas Gen-HRes delivers high-fidelity avatars in ~4 minutes, suitable for applications demanding photorealistic detail.

4. Quantitative and Qualitative Validation

Validation comprises both user paper and objective metrics:

  • User paper: 765 votes for InfiniHumanData images versus 746 for authentic scan renderings in a paired discrimination test, demonstrating that synthetic identities are indistinguishable from real scans.
  • Objective metrics: FID and CLIP scores consistently show superior performance compared to existing methods in both appearance quality and text description alignment.
  • Editing capability: Identity preservation during text-based edits was validated via Gen-HRes, with controlled variation in accessories and garments.

This empirical evaluation underlines the practical fidelity and reliability of the dataset and generative outputs.

5. Applications and Significance

InfiniHumanData and InfiniHumanGen provide direct utility in:

  • Gaming & VR: Rapid, customizable avatar creation for immersive environments.
  • Digital Fashion: Realistic virtual try-on with garment asset control.
  • Animation & Telepresence: SMPL-rigged avatars supporting motion capture re-application.
  • Figurine fabrication: Watertight mesh generation suitable for 3D printing.

The practicality and scalability afforded by foundation model distillation enable affordable and inclusive synthesis for creative, research, and industrial domains.

6. Limitations and Future Directions

Gen-HRes currently incurs slower avatar generation compared to Gen-Schnell; ongoing work aims to accelerate high-resolution synthesis. Inclusion of famous identities is omitted by GPT-4o due to privacy constraints. Mesh reconstruction improvements for self-occluded regions are identified as a direction for further research. Comprehensive public release of code, pipeline, and all data modalities is planned at https://yuxuan-xue.com/infini-human.

A plausible implication is that the InfiniHumanData paradigm, integrating vision-language and diffusion-based synthesis with foundation model distillation, marks a significant advance in the scalable, precise, and diverse generation of 3D human data, supporting both academic and industrial progress in controllable avatar technologies (Xue et al., 13 Oct 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to InfiniHumanData.