Papers
Topics
Authors
Recent
2000 character limit reached

OpenVTON-Bench: High-Res VTON Benchmark

Updated 6 February 2026
  • OpenVTON-Bench is a large-scale, high-resolution benchmark that defines a new standard for evaluating controllable Virtual Try-On systems.
  • It features a comprehensive multi-modal evaluation protocol using VLM-based scores and multi-scale representation metrics to assess garment fidelity and texture detail.
  • The benchmark leverages hybrid human–AI annotations across 20 balanced garment categories and reproducible data splits to drive robust VTON research.

OpenVTON-Bench is a large-scale, high-resolution benchmark designed for the evaluation of controllable Virtual Try-On (VTON) systems, addressing persistent limitations in existing datasets and metrics by emphasizing semantic rigor, fine-grained detail, and methodological reproducibility. It comprises approximately 100,000 paired samples of garment and person images, annotated via hybrid human–AI protocols and semantically balanced across 20 garment categories, alongside a multi-modal evaluation suite that quantifies VTON quality on interpretable dimensions using both structural and semantic measures (Li et al., 30 Jan 2026).

1. Dataset Construction and Structure

OpenVTON-Bench contains N=99,925N=99,925 image pairs, each consisting of a high-quality garment image IgI_g and its corresponding person image IgtI_{gt}. Images are constrained to high-fidelity resolutions:

1024min(H,W)max(H,W)15361024 \leq \min(H,W) \leq \max(H,W) \leq 1536

This range enables the evaluation of fine-grained pattern and texture fidelity that is critical for commercial VTON applications.

Hybrid Annotation and Captioning

Data collection began from ≈3 million web-scale and open-source images. Human annotators performed pair verification, discarding no-match, occluded, or low-quality samples, reducing the candidate set to ≈300,000 images post-filtering. Gemini-2.0-Flash was then applied for deterministic, dense captioning:

  • Coarse garment classification (upper vs. lower body) via prompt engineering
  • Category-aware structured prompts for extracting garment structure (e.g., sleeve length), texture (fabric/pattern), and design details (logos, pockets, embroidery)
  • Resulting in over 3 million words of dense, unambiguous garment descriptions

Semantic-Aware Balancing

DINOv3-generated image embeddings underpin hierarchical clustering:

ei=DINOv3(Igi),i=1300,000\mathbf{e}_i = \mathrm{DINOv3}(I_g^i),\quad i=1\ldots 300{,}000

Clusters (20 in total) correspond to fine-grained garment types (e.g., Cropped Knit Tops, Pleated Skirts). Stratified sampling ensures balanced representation (nc4,996n_c \approx 4{,}996 per category, cnc=99,925\sum_c n_c = 99{,}925), ameliorating common class bias and rare pattern under-sampling.

Data Splitting

Train, validation, and test sets are split 50\%/25\%/25\% with near-identical category distributions. An overview:

Category Count (Test)
Crew Neck T-shirts 5,008
Button-Front Coats 4,983
Wide-Leg Pants 5,012
Pleated Skirts 5,005
... ...
A-Line Dresses 4,998
Total 49,962

Test set ≈50K pairs; proportions are preserved across all splits.

2. Multi-Modal Evaluation Protocol

For each sample:

  • IpI_p: Cloth-agnostic (masked) person input
  • IgI_g: Clean garment image
  • IgtI_{gt}: Ground truth try-on
  • I^=G(Ip,Ig)\hat{I} = G(I_p, I_g): Model output

Five Quality Dimensions (VLM-Based)

Using Qwen-VL-Plus (VLM as semantic judge), every sample receives scores sbgs_{bg}, sids_{id}, stexs_{tex}, sshapes_{shape}, sreals_{real} in [1,5]:

  • sbgs_{bg}: Background consistency (unaltered non-edited regions)
  • sids_{id}: Identity fidelity (face, skin tone, body structure)
  • stexs_{tex}: Texture fidelity (pattern, logo, fabric transfer)
  • sshapes_{shape}: Shape plausibility (garment geometry, fit)
  • sreals_{real}: Overall realism (natural appearance, lighting, shadows)

Formally,

s=[sbg,sid,stex,sshape,sreal]=V(Ig,Igt,I^;T)[1,5]5\mathbf{s} = [s_{bg}, s_{id}, s_{tex}, s_{shape}, s_{real}] = \mathcal{V}(I_g, I_{gt}, \hat{I}; \mathcal{T}) \in [1,5]^5

Multi-Scale Representation Metric

To decouple boundary alignment from texture artifacts, a multi-scale approach combines SAM3 segmentation and progressive morphological erosion:

  1. Binary masks:

Mgt=SAM(Igt),M^=SAM(I^)M_{gt} = \mathcal{SAM}(I_{gt}),\quad \hat{M} = \mathcal{SAM}(\hat{I})

  1. Iterative erosion with 3×33\times3 structuring element BB:

M(k)=M(BB)k timesM_*^{(k)} = M_* \ominus (B \oplus \dots \oplus B)_{k\text{ times}}

(k=0,1,2,3k = 0,1,2,3; k=0k=0 retains full mask)

  1. Cosine similarity in DINOv3 space:

Srep(k)=Φ(I^M^(k))Φ(IgtMgt(k))Φ(I^M^(k))2Φ(IgtMgt(k))2S_{\mathrm{rep}}^{(k)} = \frac{\Phi(\hat{I} \odot \hat{M}^{(k)})^\top \Phi(I_{gt} \odot M_{gt}^{(k)})} {\|\Phi(\hat{I} \odot \hat{M}^{(k)})\|_2 \|\Phi(I_{gt} \odot M_{gt}^{(k)})\|_2}

  1. Final garment fidelity:

Sˉrep=14k=03Srep(k)\bar{S}_{\mathrm{rep}} = \frac{1}{4} \sum_{k=0}^3 S_{\mathrm{rep}}^{(k)}

Auxiliary Pixel-Based Metrics

PSNR, SSIM, LPIPS, and FID are reported for backward compatibility but are known to underrepresent semantic errors and fine texture fidelity.

3. Benchmarking and Correlation with Human Judgments

A human annotation study (76 raters, ≈90K judgments) established robust correspondence between OpenVTON-Bench metrics and subjective quality:

Metric Spearman ρs\rho_s Kendall ρk\rho_k Pearson ρp\rho_p
Avg. VLM score savgs_{avg} 0.850 0.722 0.828
Representation Sˉrep\bar{S}_{\mathrm{rep}} 0.933 0.833 0.701
PSNR 0.767 0.611 0.819
SSIM 0.767 0.611 0.801
-LPIPS 0.833 0.667 0.782
-FID 0.867 0.722 0.588

The multi-scale representation metric demonstrates superior agreement with human perception (Kendall’s ρk=0.833\rho_k=0.833 vs. $0.611$ for SSIM). VLM-based scores show strong reliability (ρk=0.722\rho_k=0.722), supporting their role as a substitute for manual annotation.

Figure 1 of the reference illustrates model-level qualitative differences: diffusion-based VTON models typically excel in overall realism but often lack in reproducing intricate textures (e.g., logos, embroidery), which the new metrics highlight decisively.

4. Best Practices and Limitations

Optimal use of OpenVTON-Bench for research entails:

  • Reporting both semantic (VLM) and structural (representation) scores jointly with pixelwise metrics
  • Per-dimension error analysis (e.g., distinguishing texture failure from alignment overfitting)
  • Leveraging the category-balanced splits for robust generalization assessment
  • Utilizing Gemini dense captions for text-conditioned ablation or editing studies

Key limitations include potential semantic and segmentation errors due to upstream model biases (Gemini, DINOv3, SAM3), under-representation of rare occlusion/extreme pose cases, and reliance on a fixed 3×33\times3 mask erosion kernel.

5. Applications and Future Directions

OpenVTON-Bench offers a reproducible, extensible platform for evaluation in:

  • High-fidelity VTON model development and benchmarking
  • Text-guided and prompt-based VTON systems using three-million-word detailed captions
  • Investigations into boundary/texture error tradeoffs and cross-category model robustness

Future directions proposed include integration of new foundation models (e.g., InternVL3, LLAVA) to mitigate model-specific biases, extension to video-based and multi-layer VTON scenarios (temporal consistency, layered outfits), and enhanced automated annotation pipelines for increased garment topological diversity.

Comprehensive data, annotation details, and evaluation code are publicly available at https://github.com/OpenVTON/OpenVTON-Bench (Li et al., 30 Jan 2026). This resource establishes a rigorous, high-resolution standard for future research in virtual try-on evaluation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to OpenVTON-Bench.