MannequinVideos: M2H Generation Benchmark
- MannequinVideos is a benchmark dataset built from controlled studio recordings that enables cross-domain synthesis of photorealistic human videos from mannequin footage.
- It features fixed rotations, prescribed poses, and multi-view captures to rigorously evaluate clothing fidelity, temporal coherence, and identity preservation using metrics like PSNR, SSIM, and FVD.
- The dataset minimizes environmental confounds, supporting reproducible ablation studies in M2H video generation and advancing applications in online fashion retail and augmented reality try-on.
The MannequinVideos dataset is a purpose-built video benchmark designed to facilitate research in mannequin-to-human (M2H) video generation, a task that synthesizes identity-controllable, photorealistic human videos from mannequin-based clothing footage. Developed within a controlled studio environment, MannequinVideos enables quantitative evaluation of frameworks like M2HVideo, focusing specifically on the challenges intrinsic to cross-domain transfer from inanimate mannequin displays to lifelike human representations. Its structure and methodology are devised to offer rigorous benchmarks for claims regarding clothing consistency, identity preservation, and spatiotemporal video fidelity.
1. Data Acquisition and Structural Characteristics
The MannequinVideos dataset was constructed through studio-based recordings featuring real mannequins as display objects. The acquisition setup incorporated a fixed rotating platform, with the mannequin centrally positioned. Each mannequin was rotated smoothly from (right) to (left), capturing comprehensive multi-view perspectives of the clothing and form across the arc. This ensures that viewpoints reflecting substantial pose variance and occlusion edge cases are present for learning and assessment.
Each video in the dataset comprises 60 consecutive frames at a spatial resolution of pixels, encoding both the fine-grained details necessary for evaluating clothing consistency and the temporal information critical for assessing output coherence. Mannequins are arranged in four distinct prescribed poses, and each is dressed in one of the following outfit types:
- T-shirts (4 styles)
- Long-sleeved shirts (3 styles)
- One-piece dresses (3 styles)
For additional variation and realism, T-shirts and long-sleeved shirts are randomly paired with either shorts or long pants, enforcing controlled outfit pairing while diversifying visual appearances.
| Attribute | Specification | Purpose |
|---|---|---|
| Capture environment | Studio, rotating platform, controlled lighting | Artifact attribution, reproducibility |
| Video resolution | Maintains detail for clothing, faces | |
| Frames per video | 60 | Sufficient window for short-form translation |
| View coverage | to rotation | Multi-view generalization |
| Clothing combinations | 3 types, multiple styles, controlled pairing | Clothing fidelity, appearance diversity |
2. Dataset Role in M2HVideo Framework Evaluation
Within the mannequin-to-human video generation task, the framework is provided a mannequin video and a target identity image , and must synthesize a photorealistic human video , i.e.,
This cross-domain mapping is distinctive compared to traditional datasets (e.g., UBC fashion, ASOS) that contain only human models. The core challenge is twofold:
- Preserve the source clothing’s visual and structural fidelity under viewpoint and pose transformation.
- Synthesize and align human-specific identity information (primarily facial) onto the mannequin video across all frames.
MannequinVideos serves as a robust testbed, imposing the constraint that all clothing appearance is sampled from non-human displays, thereby intensifying the identity injection and cross-domain translation difficulty.
3. Methodological Contributions and Benchmark Utility
The MannequinVideos dataset introduces empirically grounded, cross-domain benchmarks for mannequin-to-human video generation, a domain previously underrepresented in experimental evaluation. The studio environment, uniform viewpoint sampling, and controlled pose sets render outputs easily attributable to model strengths and weaknesses, minimizing experimental confounds due to environmental or subject variability.
Such control is vital for rigorous analysis of:
- Clothing consistency (accurate translation and preservation of fine details across frames)
- Facial identity preservation (maintenance of reference cues)
- Artifact localization (misalignments or texture drift tracing)
This facilitates clearer ablation studies and quantitative attribution of observed phenomena, strengthening reproducibility and interpretability in model comparison settings.
4. Distinguishing Features Compared to Human Video Datasets
Unlike benchmarks such as the UBC fashion and ASOS datasets, which feature human model videos and thus demand primarily intra-domain pose/identity synthesis, MannequinVideos is distinct in that:
- All footage is of static mannequins, devoid of intrinsic facial or identity cues.
- Clothing and head-body articulation are mannequin-constrained, enforcing the need for explicit "humanization" within the generative framework.
- The dataset comprises multi-view and multi-pose conditions, especially testing the model’s ability to maintain temporal coherence and spatial alignment between the synthesized human head (reference-driven) and the clothing/body region as rotations and poses vary.
This multi-dimensional challenge is uniquely addressed by MannequinVideos, making it indispensable for robust cross-domain video generation methods.
5. Metrics, Evaluation Protocols, and Empirical Support
MannequinVideos' controlled experimental protocol enables isolated, structured evaluation. In reported experiments, the following metrics are employed:
- Clothing consistency: PSNR, SSIM, LPIPS
- Identity preservation: Cosine similarity (CSIM) between generated and reference faces
- Video fidelity: Fréchet Video Distance (FVD), capturing temporal quality and realism
Results show that M2HVideo achieves lower FVD values (indicating superior temporal coherence) and higher CSIM for identity preservation when evaluated on MannequinVideos. The use of a dynamic pose-aware head encoder ensures effective facial identity transfer, while the mirror loss—applied in pixel space on all generated frames via a denoising diffusion implicit model (DDIM)—consistently improves the recovery of high-frequency facial details. The controlled capture mitigates confounds, enabling these claims to be attributable directly to methodological advances rather than dataset noise or sample diversity.
6. Impact and Significance for the Field
The MannequinVideos dataset fulfills several roles within mannequin-to-human video synthesis research:
- Provides a rigorously controlled environment that decouples method performance from extrinsic variance.
- Establishes a cross-domain benchmark essential for evaluating real-world deployments where non-human clothing displays are prevalent (e.g., online fashion retailers, augmented try-on systems).
- Enables comprehensive quantitative and qualitative study of fidelity, consistency, and reconstruction under challenging translation tasks.
- Supports the reproducibility and generalizability of experimental findings within the M2HVideo framework and related future methodologies.
A plausible implication is that emerging methods building on mannequin-to-human transfer will increasingly rely on MannequinVideos, both as a deliberate generalization challenge and as a canonical controlled-comparison benchmark.