Virtual Try-Off (VTOFF) Overview

Updated 1 June 2026

Virtual Try-Off (VTOFF) is a computer vision task that generates flat-lay garment images from dressed individuals, disentangling appearance from deformations and occlusions.
Recent methods leverage diffusion models, transformer architectures, and unified frameworks to achieve high-fidelity garment reconstruction with metrics like SSIM and FID.
VTOFF advancements facilitate product digitization, data augmentation, and retrieval tasks while addressing challenges in fine detail, multi-garment setups, and instruction-guided editing.

Virtual Try-Off (VTOFF) is a computer vision and generative modeling task defined as the inverse of traditional Virtual Try-On (VTON). Instead of rendering a garment onto a target person (as in VTON), VTOFF reconstructs a canonical, catalog-style garment image from an in-the-wild photograph of a dressed person. The canonical output is typically a flat-lay or frontal product image, enabling high-fidelity product digitization, data augmentation, and downstream compositional and retrieval tasks. The core challenge in VTOFF lies in disentangling garment appearance from body-induced deformations, pose, occlusions, and background while consistently inferring unobserved details and preserving fine-grained textures. The field has rapidly evolved from GAN-based pipelines toward diffusion and transformer architectures, motivating a new taxonomy of problem definitions, model paradigms, data curation strategies, and evaluation metrics.

1. Problem Formalization and Taxonomy

Formally, VTOFF seeks a mapping from an observed image $\mathbf{I} \in \mathbb{R}^{H \times W \times 3}$ of a clothed person to a standardized garment image $\hat{\mathbf{G}} \in \mathbb{R}^{H \times W \times 3}$ , ideally matching the unknown ground-truth flat-lay product image $\mathbf{G}$ :

$\hat{\mathbf{G}} = F(\mathbf{I}, M, c)$

where $M$ is an optional garment segmentation mask, and $c$ an optional text prompt of structural attributes. In practice, $F$ must recover geometric structure, high-frequency appearance, and semantic category from a single, often-occluded, posed human observation.

Primary VTOFF settings include:

Upper/lower/dress garment categories: Single- and multi-category (Lobba et al., 27 May 2025).
Multi-garment extension: Simultaneous reconstruction of several garments worn or layered (Lobba et al., 27 May 2025).
Conditional and instruction-driven editing: Generating not only the original but also an edited (e.g., color/pattern-altered) version, typically by integrating a natural language prompt (Sanguigni et al., 23 Mar 2026).

VTOFF contrasts with related tasks:

Classic VTON: person + (flat) garment → person-in-garment. VTOFF does the reverse (Velioglu et al., 2024, Truong et al., 9 Apr 2026).
Person-to-person try-on: Uses VTOFF as an interior step to move from a source-wearer’s photo to a clean product representation, decoupling unwanted attribute transfer (Wang et al., 21 Jul 2025, Velioglu et al., 17 Apr 2025).
Model-free vs. model-based: Some VTOFF setups do not assume access to the flat-lay garment or rely on mask-free pipelines (Liu et al., 6 Aug 2025).

2. Core Methodological Paradigms

2.1. Diffusion-based VTOFF

Diffusion models (DDPMs, latent diffusion, flow-matching transformers) form the recent backbone for VTOFF due to their capacity to recover both global structure and local detail:

Single-UNet latent diffusion: Adapts architectures such as Stable Diffusion v1.x by substituting text prompts with image-encoder (CLIP, SigLIP) embeddings as conditioning inputs (Velioglu et al., 2024). Conditioning tokens are injected into cross-attention of the denoising UNet.
Dual-UNet and Parallel UNet architectures: Employ an auxiliary "reference" UNet or a feature extractor to preserve high-frequency detail and geometric cues; these features are fused at multiple scales into the main (denoising) UNet (Zhu et al., 5 Jan 2026, Truong et al., 9 Apr 2026).
DiT transformers and Flow Matching: Flow-based DiT transformers, e.g., OmniDiT (Zeng et al., 20 Mar 2026), Voost (Lee et al., 6 Aug 2025), and PROMO (Chen et al., 12 Mar 2026), model VTOFF and VTON jointly, learning a bidirectional relational mapping with flow-matching loss or rectified ODE solvers. Conditioning is flexibly tokenized, with task tokens indicating try-on/try-off mode and garment category.

2.2. Bidirectional and Unified Frameworks

Recent research emphasizes joint modeling of VTON and VTOFF:

Unified dual-purpose backbones (e.g., Voost, OmniDiT, OMFA) encode both garment-to-person and person-to-garment transformations with minimal architectural changes, using mode tokens or partial diffusion masks (Lee et al., 6 Aug 2025, Zeng et al., 20 Mar 2026, Liu et al., 6 Aug 2025).
Partial diffusion: Dynamically applies noise and denoising to specific input regions, allowing a single model to flexibly perform try-on, try-off, or hybrid transformations without segmentation masks (Liu et al., 6 Aug 2025).

2.3. Multimodal and Instruction-Guided VTOFF

Instruction-driven diffusion models propose unifying VTOFF, VTON, and garment editing via natural-language prompt integration:

Dress-EM in Dress-ED encodes (image, instruction) pairs via large MLLMs (e.g., InternVL-3.5), with fused visual-linguistic tokens guiding the diffusion process (Sanguigni et al., 23 Mar 2026).
Alignment and structure losses further regularize the output for edit faithfulness in addition to appearance fidelity.

3. Conditioning, Feature Extraction, and Losses

Efficient VTOFF solutions rely on expressive conditioning and specialized auxiliary objectives:

Visual encoding: SigLIP (Velioglu et al., 2024), CLIP, DINOv2 (Zeng et al., 20 Mar 2026, Lobba et al., 27 May 2025), and OpenCLIP encode the wearer’s input image to provide rich, spatially aligned feature maps.
Attention mechanisms: Hybrid attention blocks that fuse reference features into the main decoder at multiple scales are critical for transferring both semantic (category, class) and high-frequency (logo, print) detail (Zhu et al., 5 Jan 2026, Velioglu et al., 17 Apr 2025, Liu et al., 10 Mar 2026).
Mask and text guidance: Garment segmentation masks (precise or dilated) or category tokens (for mask-free methods) are used to localize and type-constrain generation (Truong et al., 9 Apr 2026, Lobba et al., 27 May 2025). Structured text prompts resolve ambiguous or occluded regions and guide structural modifications (Sanguigni et al., 23 Mar 2026).
Auxiliary losses: Perceptual similarity (LPIPS), region-specific alignment via VGG/DINO features, and explicit alignment losses on the predicted garment region ensure fidelity to reference structure and appearance (Lobba et al., 27 May 2025, Zeng et al., 20 Mar 2026, Liu et al., 10 Mar 2026).

4. Quantitative and Qualitative Evaluation

VTOFF models are benchmarked chiefly on paired datasets with catalog and wearer images: VITON-HD, DressCode, and Omni-TryOn (recent mega-dataset, >380k triplets) (Zeng et al., 20 Mar 2026).

Metrics:

Structural similarity: SSIM, MS-SSIM, CW-SSIM (higher is better).
Perceptual distance: LPIPS, DISTS, CLIP-FID, KID, FID (lower is better); DISTS increasingly preferred for measuring correspondence of structure and texture (Velioglu et al., 2024, Truong et al., 9 Apr 2026).
Semantic identity: DINO-I cosine similarity; edit correctness in instruction-driven pipelines.

State-of-the-art examples (VITON-HD or DressCode):

AlignVTOFF achieves SSIM 78.1, DISTS 21.6, LPIPS 24.7, FID 14.7, KID 4.4 (Zhu et al., 5 Jan 2026).
BridgeDiff reports FID 9.08, KID 1.53, SSIM 77.42, DISTS 18.69 (Liu et al., 10 Mar 2026).
TEMU-VTOFF on DressCode: FID 5.74, KID 0.65, LPIPS 31.46 (Lobba et al., 27 May 2025).
Dress-EM achieves SSIM 0.88, LPIPS 0.206, DISTS 0.189, FID 5.06 on instruction-driven VTOFF (Sanguigni et al., 23 Mar 2026).

Qualitative inspections are paramount, particularly for assessing recovery under strong occlusion, texture detail reconstruction, fine category distinctions, and robustness to unusual poses and multi-garment wear.

5. Data, Scalability, and Practical Considerations

Large-scale, high-fidelity datasets are foundational:

Omni-TryOn (380k samples), Dress-ED (146k quadruplets) incorporate large-scale fabric/pose diversity, editable variants, and natural-language edits, curated via multimodal LLM pipelines and inpainting/generation methods (Zeng et al., 20 Mar 2026, Sanguigni et al., 23 Mar 2026).
Data curation pipelines involve VLM-based filtering, garment identification, mask generation, and pose augmentation via DensePose/SMPL-X (Zeng et al., 20 Mar 2026, Liu et al., 6 Aug 2025).

Efficiency:

Single-UNet architectures (e.g., Re-CatVTON) approach dual-UNet performance at far lower computational overhead (≈1.3 s/img, 2.3 GB VRAM) (Na et al., 24 Nov 2025).
OMFA and EfficientVITON demonstrate real-time or near-real-time inference on commodity hardware, through architectural simplification and non-uniform denoising step schedules (Liu et al., 6 Aug 2025, Atef et al., 20 Jan 2025).

6. Limitations, Open Problems, and Future Directions

Despite rapid advances, VTOFF presents persistent challenges:

Fine detail/occlusion: Complete inference of complex textures, small logos, multi-layer details, or features obscured by pose remains non-trivial, especially for lower-body garments and accessories (Lobba et al., 27 May 2025, Liu et al., 10 Mar 2026).
Multi-garment and 3D: Methods extend to multi-garment settings (e.g., layered outfits) but joint garment factorization, interaction, or 3D shape recovery is still nascent (Liu et al., 10 Mar 2026, Lobba et al., 27 May 2025).
Instruction and generalization: True instruction-following (structure+appearance edits) is enabled only by large, multimodally-annotated datasets and transformer architectures. Robustness to in-the-wild, real-world photos and rare garment types is not yet universal (Sanguigni et al., 23 Mar 2026).
Cycle consistency and bidirectional optimization: Whether closed-loop training (person→garment→person) or consistency regularization can further tighten correspondence and reduce identity/attribute drift is an open question (Velioglu et al., 2024, Zeng et al., 20 Mar 2026).
Metric alignment: Ongoing need for perceptually and semantically aligned evaluation metrics, possibly learned or user-study-calibrated, beyond DISTS and LPIPS (Velioglu et al., 2024).

Potential research avenues include integrating adversarial or hybrid GAN+diffusion losses for sharper detail, task-specific prompt tuning, multi-view garment reconstruction, and interactive personalization modules. Deploying VTOFF at scale will require advances in inference speed, domain robustness, and automated data curation.

Selected Comparative Table: Leading VTOFF Models and Benchmarks

Model	Backbone/Cond.	FID ↓	DISTS ↓	LPIPS ↓	SSIM ↑	Dataset
AlignVTOFF (Zhu et al., 5 Jan 2026)	Parallel U-Net, CLIP	14.7	21.6	24.7	78.1	VITON-HD
BridgeDiff (Liu et al., 10 Mar 2026)	Dual-Stage, OCLIP	9.08	18.69	24.38	77.4	VITON-HD
TEMU-VTOFF (Lobba et al., 27 May 2025)	Dual DiT, CLIP+TXT	5.74	—	31.46	—	DressCode
TryOffDiff (Velioglu et al., 2024)	SD1.4, SigLIP	25.1	23.0	32.4	79.5	VITON-HD
Dress-EM (Sanguigni et al., 23 Mar 2026)	DiT, MLLM	5.06	0.189	0.206	0.88	Dress-ED

All results are as reported in each respective work. SSIM, DISTS, and LPIPS are presented in their native scales to facilitate comparison; consult original publications for metric normalization.

7. References

(Velioglu et al., 2024) TryOffDiff: Virtual-Try-Off via High-Fidelity Garment Reconstruction using Diffusion Models
(Zhu et al., 5 Jan 2026) AlignVTOFF: Texture-Spatial Feature Alignment for High-Fidelity Virtual Try-Off
(Liu et al., 10 Mar 2026) BridgeDiff: Bridging Human Observations and Flat-Garment Synthesis for Virtual Try-Off
(Truong et al., 9 Apr 2026) What Matters in Virtual Try-Off? Dual-UNet Diffusion Model For Garment Reconstruction
(Liu et al., 6 Aug 2025) One Model For All: Partial Diffusion for Unified Try-On and Try-Off in Any Pose
(Lee et al., 6 Aug 2025) Voost: A Unified and Scalable Diffusion Transformer for Bidirectional Virtual Try-On and Try-Off
(Lobba et al., 27 May 2025) Inverse Virtual Try-On: Generating Multi-Category Product-Style Images from Clothed Individuals
(Na et al., 24 Nov 2025) Rethinking Garment Conditioning in Diffusion-based Virtual Try-On
(Zeng et al., 20 Mar 2026) OmniDiT: Extending Diffusion Transformer to Omni-VTON Framework
(Sanguigni et al., 23 Mar 2026) Dress-ED: Instruction-Guided Editing for Virtual Try-On and Try-Off
(Atef et al., 20 Jan 2025) EfficientVITON: An Efficient Virtual Try-On Model using Optimized Diffusion Process