Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 154 tok/s
Gemini 2.5 Pro 43 tok/s Pro
GPT-5 Medium 23 tok/s Pro
GPT-5 High 32 tok/s Pro
GPT-4o 119 tok/s Pro
Kimi K2 175 tok/s Pro
GPT OSS 120B 362 tok/s Pro
Claude Sonnet 4.5 35 tok/s Pro
2000 character limit reached

InfiniHumanData: 3D Avatars Dataset

Updated 15 October 2025
  • InfiniHumanData is a large-scale, multi-modal dataset comprising 111,000 automatically generated 3D human identities annotated with multi-granularity text, RGB images, clothing assets, and SMPL parameters.
  • The pipeline integrates foundation models like GPT-4o, FLUX, and OminiControl to generate consistent, richly labeled data that supports precise 3D reconstruction and avatar control.
  • The dataset enables practical applications in VR, gaming, digital fashion, and re-animation by allowing fine-grained manipulation of body shape, pose, and garment style.

InfiniHumanData is a fully automatic, large-scale multi-modal dataset designed to enable the generation of realistic, diverse, and precisely controllable 3D human avatars. Constructed via a synergistic distillation of foundation models spanning vision-language, image generation, and body fitting, InfiniHumanData comprises 111,000 identities with unprecedented diversity, each annotated across multiple modalities. The dataset was specifically developed to overcome the expense and limitations of traditional human scan datasets, enabling scalable, richly annotated data at minimal cost. The underlying pipeline integrates multi-granularity text descriptions, orthographic multi-view RGB imagery, detailed clothing images derived from image-to-image translation, and SMPL body shape parameters. These assets support the creation and controllable manipulation of avatars using advanced diffusion-based generative models.

1. Dataset Composition and Annotation Modalities

InfiniHumanData contains 111,000 automatically generated 3D human identities, each annotated across four principal modalities:

  • Multi-granularity text descriptions: Each identity is labeled with textual descriptions ranging in length from approximately 40 words (highly detailed) down to as few as 5 words (minimal). This multi-resolution annotation enables both fine-grained and coarse semantic conditioning.
  • Orthographic multi-view RGB images: All identities include multiple orthographic full-body and head RGB images rendered under uniform lighting to facilitate compatibility with 3D reconstruction tasks.
  • Clothing modality: Clothing images are generated using a process termed “Virtual-TryOff,” which reverses the virtual try-on paradigm to extract clean garment assets from full-body imagery, utilizing modified image-to-image translation architectures.
  • SMPL body shape parameters and keypoints: Detailed body shape, pose, and facial/hand landmarks are derived via two-stage monocular body fitting, enabling precise 3D parametric body modeling.

The dataset exhibits broad coverage with respect to ethnicity, age (including children), clothing style, hair type, and body shape. Diversity is achieved by employing GPT-4o to produce varied and multi-granularity text descriptions from human scan datasets, further summarized at ten granularity levels.

Modality Annotation Type Generation/Extraction Method
Text Descriptions Multi-granularity GPT-4o protocol + summarization
RGB Images Multi-view, orthographic FLUX text-to-image with LoRA fine-tuning
Clothing Images Garment-only, detailed OminiControl + flow-matching loss
Body Shape (SMPL) Parameters + Keypoints Monocular fitting then reprojection alignment

2. Data Generation Pipeline

The pipeline underpinning InfiniHumanData coordinates the distillation of multiple foundation models via the following steps:

  • Text annotation: Existing scan datasets (THuman2.1, CustomHuman, 2K2K) receive captions generated by protocols from Trellis and expanded with GPT-4o, which introduces variation and granularity in attributes such as ethnicity and clothing.
  • Text-to-image generation: The FLUX model, fine-tuned with a LoRA adapter, synthesizes “scan-like” RGB images with orthographic viewpoints and consistent lighting, optimizing them for downstream 3D reconstruction.
  • Clothing extraction: Garment images are isolated by a modified image-to-image translation module built on OminiControl, utilizing a flow-matching objective defined as:

LVToFF(θ)=Et,ϵvθ(xt,Ivton,etext,t)(ϵIcloth)2L_{\text{VToFF}}(\theta) = \mathbb{E}_{t, \epsilon} \| v_{\theta}(x_t, I^{vton}, e^{text}, t) - (\epsilon - I^{cloth}) \|^2

where xtx_t is a noisy garment image at step tt, ϵ\epsilon is Gaussian noise, and etexte^{text} provides conditional instructions.

  • Sample selection: GPT-4o evaluates multiple garment candidates for each identity and selects the best match based on specified attributes.
  • SMPL parameter fitting: Initial regression is performed using models such as NLF, followed by refinement to minimize reprojection error between SMPL and OpenPose-detected 2D joints:

Lreproj(θ)=k=1Kwkπortho(Jk(SMPL(θ,β))JkOpenPose)2L_{reproj}(\theta) = \sum_{k=1}^{K} w_k \| \pi_{ortho}(J_k(SMPL(\theta, \beta)) - J_k^{OpenPose}) \|^2

  • Multi-view diffusion: A specifically trained diffusion model generates high-resolution multi-view images. Specialized attention mechanisms, including row-wise attention across fixed epipolar geometries, enforce inter-view consistency.

3. Technological Innovations

Several technological advancements distinguish InfiniHumanData:

  • Integration of foundation models: The pipeline coalesces capabilities from GPT-4o (structured annotation), FLUX (orthographic image synthesis), OminiControl (clothing extraction), and multi-view diffusion models.
  • Generative pipelines: Two novel avatar creation pipelines—Gen-Schnell and Gen-HRes—enable rapid and photorealistic mesh generation with fine-grained controllability:

    • Gen-Schnell: Utilizes multi-view 2D diffusion and Gaussian splatting for 3D avatar synthesis in approximately 12 seconds, with consistency enforced through:

    x~0=1αˉt(xt1αˉtϵθ(xt,ctext,cSMPL,ccloth,t))\tilde{x}_0 = \frac{1}{\sqrt{\bar{\alpha}_t}} \left( x_t - \sqrt{1-\bar{\alpha}_t} \epsilon_{\theta}(x_t, c^{text}, c^{SMPL}, c^{cloth}, t) \right)

    where this is subsequently incorporated into a 3D Gaussian splatting module g(ϕ)g(\phi). - Gen-HRes: Leverages a multi-modal image-to-image translation framework (OminiControl2 + Flux-Dev) for high-resolution mesh generation (~4 minutes), supporting nuanced attribute adjustments.

  • Joint conditioning: InfiniHumanGen is conditioned on a composite signal P(yctext,cSMPL,ccloth)P(y | c^{text}, c^{SMPL}, c^{cloth}), allowing simultaneous control of text, body shape/pose, and garment style.

4. Evaluation, Performance, and Impact

Extensive evaluation demonstrates that Gen-Schnell and Gen-HRes both improve upon prior techniques with respect to visual quality, generation speed, and alignment of output with conditioning signals. Quantitative metrics include leading scores on multi-view consistency, CLIP Score, and FID, while user studies establish that generated identities are indistinguishable from scan renderings. These models set a new standard for controllable 3D human avatar synthesis.

Applications span digital avatar creation for VR, gaming, digital fashion (virtual try-on systems), social telepresence, re-animation using SMPL motion data, and physical 3D printing (figurine fabrication). The ability to alter attributes such as clothing or ethnicity while preserving identity supports new paradigms in personalized content creation.

5. Future Directions and Public Availability

InfiniHumanData, alongside the data generation pipeline and generative models (InfiniHumanGen: Gen-Schnell and Gen-HRes), is scheduled for public release via the project website (https://yuxuan-xue.com/infini-human). Proposed future research directions include:

  • Advancing end-to-end high-resolution 3D generation (enhanced Gen-Schnell) to better resolve facial detail without increased inference time.
  • Enriching the dataset with additional modalities (e.g., celebrity identities) annotated in a privacy-compliant manner using alternative LLMs.
  • Improving 3D mesh reconstruction to mitigate occlusion artifacts through data-driven mesh refinement, moving beyond SMPL-driven volumetric carving.

InfiniHumanData constitutes a scalable, richly annotated resource that fundamentally addresses key limitations of previous human avatar datasets by exploiting automatic distillation of multiple foundation models. The resulting framework yields unprecedented speed, fidelity, diversity, and controllability in 3D human synthesis, supporting a wide range of downstream applications.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to InfiniHumanData Dataset.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube