OpenVTON-Bench: High-Res VTON Benchmark
- OpenVTON-Bench is a large-scale, high-resolution benchmark that defines a new standard for evaluating controllable Virtual Try-On systems.
- It features a comprehensive multi-modal evaluation protocol using VLM-based scores and multi-scale representation metrics to assess garment fidelity and texture detail.
- The benchmark leverages hybrid human–AI annotations across 20 balanced garment categories and reproducible data splits to drive robust VTON research.
OpenVTON-Bench is a large-scale, high-resolution benchmark designed for the evaluation of controllable Virtual Try-On (VTON) systems, addressing persistent limitations in existing datasets and metrics by emphasizing semantic rigor, fine-grained detail, and methodological reproducibility. It comprises approximately 100,000 paired samples of garment and person images, annotated via hybrid human–AI protocols and semantically balanced across 20 garment categories, alongside a multi-modal evaluation suite that quantifies VTON quality on interpretable dimensions using both structural and semantic measures (Li et al., 30 Jan 2026).
1. Dataset Construction and Structure
OpenVTON-Bench contains image pairs, each consisting of a high-quality garment image and its corresponding person image . Images are constrained to high-fidelity resolutions:
This range enables the evaluation of fine-grained pattern and texture fidelity that is critical for commercial VTON applications.
Hybrid Annotation and Captioning
Data collection began from ≈3 million web-scale and open-source images. Human annotators performed pair verification, discarding no-match, occluded, or low-quality samples, reducing the candidate set to ≈300,000 images post-filtering. Gemini-2.0-Flash was then applied for deterministic, dense captioning:
- Coarse garment classification (upper vs. lower body) via prompt engineering
- Category-aware structured prompts for extracting garment structure (e.g., sleeve length), texture (fabric/pattern), and design details (logos, pockets, embroidery)
- Resulting in over 3 million words of dense, unambiguous garment descriptions
Semantic-Aware Balancing
DINOv3-generated image embeddings underpin hierarchical clustering:
Clusters (20 in total) correspond to fine-grained garment types (e.g., Cropped Knit Tops, Pleated Skirts). Stratified sampling ensures balanced representation ( per category, ), ameliorating common class bias and rare pattern under-sampling.
Data Splitting
Train, validation, and test sets are split 50\%/25\%/25\% with near-identical category distributions. An overview:
| Category | Count (Test) |
|---|---|
| Crew Neck T-shirts | 5,008 |
| Button-Front Coats | 4,983 |
| Wide-Leg Pants | 5,012 |
| Pleated Skirts | 5,005 |
| ... | ... |
| A-Line Dresses | 4,998 |
| Total | 49,962 |
Test set ≈50K pairs; proportions are preserved across all splits.
2. Multi-Modal Evaluation Protocol
For each sample:
- : Cloth-agnostic (masked) person input
- : Clean garment image
- : Ground truth try-on
- : Model output
Five Quality Dimensions (VLM-Based)
Using Qwen-VL-Plus (VLM as semantic judge), every sample receives scores , , , , in [1,5]:
- : Background consistency (unaltered non-edited regions)
- : Identity fidelity (face, skin tone, body structure)
- : Texture fidelity (pattern, logo, fabric transfer)
- : Shape plausibility (garment geometry, fit)
- : Overall realism (natural appearance, lighting, shadows)
Formally,
Multi-Scale Representation Metric
To decouple boundary alignment from texture artifacts, a multi-scale approach combines SAM3 segmentation and progressive morphological erosion:
- Binary masks:
- Iterative erosion with structuring element :
(; retains full mask)
- Cosine similarity in DINOv3 space:
- Final garment fidelity:
Auxiliary Pixel-Based Metrics
PSNR, SSIM, LPIPS, and FID are reported for backward compatibility but are known to underrepresent semantic errors and fine texture fidelity.
3. Benchmarking and Correlation with Human Judgments
A human annotation study (76 raters, ≈90K judgments) established robust correspondence between OpenVTON-Bench metrics and subjective quality:
| Metric | Spearman | Kendall | Pearson |
|---|---|---|---|
| Avg. VLM score | 0.850 | 0.722 | 0.828 |
| Representation | 0.933 | 0.833 | 0.701 |
| PSNR | 0.767 | 0.611 | 0.819 |
| SSIM | 0.767 | 0.611 | 0.801 |
| -LPIPS | 0.833 | 0.667 | 0.782 |
| -FID | 0.867 | 0.722 | 0.588 |
The multi-scale representation metric demonstrates superior agreement with human perception (Kendall’s vs. $0.611$ for SSIM). VLM-based scores show strong reliability (), supporting their role as a substitute for manual annotation.
Figure 1 of the reference illustrates model-level qualitative differences: diffusion-based VTON models typically excel in overall realism but often lack in reproducing intricate textures (e.g., logos, embroidery), which the new metrics highlight decisively.
4. Best Practices and Limitations
Optimal use of OpenVTON-Bench for research entails:
- Reporting both semantic (VLM) and structural (representation) scores jointly with pixelwise metrics
- Per-dimension error analysis (e.g., distinguishing texture failure from alignment overfitting)
- Leveraging the category-balanced splits for robust generalization assessment
- Utilizing Gemini dense captions for text-conditioned ablation or editing studies
Key limitations include potential semantic and segmentation errors due to upstream model biases (Gemini, DINOv3, SAM3), under-representation of rare occlusion/extreme pose cases, and reliance on a fixed mask erosion kernel.
5. Applications and Future Directions
OpenVTON-Bench offers a reproducible, extensible platform for evaluation in:
- High-fidelity VTON model development and benchmarking
- Text-guided and prompt-based VTON systems using three-million-word detailed captions
- Investigations into boundary/texture error tradeoffs and cross-category model robustness
Future directions proposed include integration of new foundation models (e.g., InternVL3, LLAVA) to mitigate model-specific biases, extension to video-based and multi-layer VTON scenarios (temporal consistency, layered outfits), and enhanced automated annotation pipelines for increased garment topological diversity.
Comprehensive data, annotation details, and evaluation code are publicly available at https://github.com/OpenVTON/OpenVTON-Bench (Li et al., 30 Jan 2026). This resource establishes a rigorous, high-resolution standard for future research in virtual try-on evaluation.