Papers
Topics
Authors
Recent
Search
2000 character limit reached

uHumans Collection: Tera-Scale Human Dataset

Updated 30 December 2025
  • uHumans Collection is a comprehensive multiview dataset capturing detailed human expressions across face, hand, gaze, body, and garment modalities from 772 subjects.
  • It employs a high-precision capture infrastructure with 107 synchronized cameras and advanced 3D reconstruction pipelines for accurate, view-specific modeling.
  • The dataset supports both model-based and learning-based approaches, enabling robust research in pose estimation, appearance synthesis, and social telepresence.

The uHumans Collection refers to the tera-scale multiview human body expression resource released in the HUMBI project. It provides synchronized high-definition imagery and 3D model reconstructions of 772 diverse subjects, comprehensively capturing facial, hand, gaze, full-body, and garment information. By enabling accurate view-specific modeling of human geometry and appearance, HUMBI’s uHumans Collection supports both model-based and learning-based approaches for tasks such as pose estimation, appearance synthesis, and social telepresence. This resource is designed to address the limitations of existing datasets regarding subject diversity, multiview density, and simultaneous coverage of multiple “body signal” modalities (Yoon et al., 2021).

1. Scope and Demographics

The uHumans Collection comprises 772 voluntary participants, distributed to ensure substantial diversity in gender (50.7 % female, 49.3 % male), age (26 % teenagers, 29 % in their 20s, 11 % in their 30s, and the remainder across other age groups), and ethnicity/skin tone (black, dark-brown, light-brown, white). Participants wear natural garments from a wide variety of styles, including dresses, T-shirts (short/long sleeve), jackets, hats, shorts, long pants, and various combinations. Each subject completes four guided sessions, generating:

  • ~93,000 gaze images,
  • 17.3 million face images,
  • 24 million hand images,
  • 26 million images each for body and garment (Yoon et al., 2021).

2. Capture Infrastructure and Synchronization

The imaging apparatus consists of a modular dodecagon frame (4.2 m diameter, 2.5 m high), with 107 synchronized GoPro HD cameras (38 HERO 5, 69 HERO 3+) placed at ~10° intervals on two arcs (heights 0.8 m and 1.6 m) plus a 38-camera frontal hemisphere for face/gaze capture. Subjects interact with LED video screens guiding each performance, with synchronization flashes provided by small LED panels, ensuring frame-precise alignment (15\leq 15 ms skew).

Calibration is performed using COLMAP Structure-from-Motion, with the scale defined by fixed physical baselines and the coordinate system origin at the stage center, yy-axis vertical. After fitting parametric or template meshes, per-view HD imagery is back-projected onto UV atlases to reconstruct dense, view-specific appearance maps for all modalities. Eye image patches are normalized (36 × 60 px) and mapped to 256 × 256 UV atlases for gaze (Yoon et al., 2021).

3. Modalities and Data Representations

The dataset includes five primary “body signals”:

Signal Images Mesh Model/Geometry Appearance Representation
Gaze ∼93k Unit vector gS2g\in S^2 36×60 norm. patch, 256×256 UV
Face 17.3M Surrey 3DMM; V(αs,αe)V(\alpha^s,\alpha^e) 1024×1024 UV atlas, RGBA
Hand 24M MANO; V(αth,αb)V(\alpha^{th},\alpha^b) 512×512 UV atlas
Body 26M SMPL; V(β,θ)V(\beta,\theta) 1024×1024 UV atlas, occupancy
Garment 26M Custom template deformations Per-template UV atlas

Additionally, semantic 3D point clouds are obtained via multiview stereo and labeled using a 20-class CNN (e.g., hair, skin, shirt) (Yoon et al., 2021).

  • Gaze: Geometry encoded as unit vectors in a head-centered frame, derived from eye centers and mouth landmarks.
  • Face: Meshes parameterized using 63D shape and 6D expression vectors; appearance encoding via view-specific UV texture mapping and spherical harmonics lighting.
  • Hand: MANO-driven meshes with 45D pose, 10D shape, and dense per-view textures.
  • Body: SMPL meshes fit using 10D shape, 24x3D pose, and detailed occupancy volumes via silhouette space carving.
  • Garment: In-house mesh templates warped via per-vertex SE(3)SE(3) transforms and separated appearance atlases.

4. 3D Reconstruction Pipeline

A two-stage 3D reconstruction process applies to all signals:

  1. Keypoint/Silhouette Detection: 2D keypoints or silhouette boundaries are estimated per view.
  2. Multiview Fusion and Model Fitting: Nonlinear optimization aligns parametric meshes to multiview constraints using energy formulations tailored for each modality:

Face Fitting:

E=Ek+λaEa,λa=105E = E^k + \lambda^a E^a, \quad \lambda^a=10^{-5}

with

Ek(Q,αs,αe)=iKiQ(Vi)2,Ea(αs,αe,αt,αh)=jcjg(V,T,αh,Pj)2E^k(Q,\alpha^s,\alpha^e) = \sum_i \|K_i-Q(V_i)\|^2, \quad E^a(\alpha^s,\alpha^e,\alpha^t,\alpha^h) = \sum_j \|c_j-g(V,T,\alpha^h,P_j)\|^2

where KiK_i are 3D landmarks, QQ is a rigid transformation, g()g(\cdot) is Lambertian rendering.

Hand Fitting:

E=Ek+λthEth+λbEbE = E^k + \lambda^{th}E^{th} + \lambda^b E^b

with regularization on pose and shape parameters.

Body Fitting:

E(θ,β,t,s)=Ep+λsEs+λrErE(\theta,\beta,t,s) = E^p + \lambda^s E^s + \lambda^r E^r

combining keypoint, silhouette-hull, and temporal consistency.

Garment Fitting:

E(R,t)=Ec+λoEo+λgEgE(R, t) = E^c + \lambda^o E^o + \lambda^g E^g

aligns garment mesh to the fitted body and observed outer hull with regularization.

All optimizations are performed with Ceres or alternating Gauss–Newton routines (Yoon et al., 2021).

5. Pose-Guided Appearance Rendering Benchmark

The collection serves as the foundation for a benchmark in pose-guided appearance rendering:

  • Task: Given a single HD view of subject AA in pose psp_s and camera csc_s, generate AA in a new pose ptp_t and/or camera ctc_t.
  • Target: Photorealistic 256×256 (resp. 256×176 for PPA) RGB output.
  • Dataset Split: 100 training subjects (101 k source/target pairs); 40 testing subjects (15 923 pairs).
  • Baselines: PG-1/2, C2GAN, PPA, SGAN, GFLA, NHRR—all trained from scratch on the HUMBI split.
  • Metrics: LPIPS, FID, Mask-LPIPS, Mask-FID (human mask from YOLACT).

Test set results:

  • Best FID: GFLA (14.25, Mask-FID 13.49)
  • Best Mask-LPIPS: NHRR (0.115)
  • PPA shows improved background modeling with masking (Mask-FID: 19.95 vs 114.08 unmasked).

Key observations: view-dependent lighting and backgrounds are learnable; clothing style transfer is challenging; facial detail fidelity and hair synthesis are ongoing research frontiers (Yoon et al., 2021).

The dataset is publicly available at http://humbi-data.net for academic research. File structure per subject/session includes:

  • /images/{gaze,face,hand,body,garment}/camera_##.jpg
  • /keypoints/3D_landmarks_{face,hand,body}.npz
  • /meshes/{face.obj,hand.obj,body.obj,garment.obj} plus UV maps (.png)
  • /occupancy/body_hull.ply, semantic_point_clouds.ply

Recommended usage:

  • Combine HUMBI with single-view datasets to enhance single-image 3D reconstruction.
  • Utilize pre-computed meshes/keypoints to avoid computationally intensive optimization.
  • Leverage dense multiview appearance maps to condition neural rendering models.

A plausible implication is that the uHumans Collection, as the first tera-scale dataset combining 772 subjects, 107 views, and five fundamental body signals, is positioned as a central resource for advancing high-fidelity, view-specific digitization and rendering of human expressions, complementing prior sparser datasets such as MPII-Gaze, Multi-PIE, Human3.6M, and Panoptic Studio (Yoon et al., 2021).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to uHumans Collection.