Papers
Topics
Authors
Recent
2000 character limit reached

Virtual Try-Off: Canonical Garment Imaging

Updated 25 November 2025
  • Virtual Try-Off is a computational imaging task that extracts standardized, catalog-style garment images from photographs to enable precise evaluations.
  • It leverages latent diffusion architectures, multimodal attention mechanisms, and tailored loss functions to overcome occlusion, deformation, and variable illumination challenges.
  • State-of-the-art models like TryOffDiff and TEMU-VTOFF demonstrate superior performance metrics (DISTS, FID, SSIM) that benefit e-commerce imagery pipelines.

Virtual Try-Off (VTOFF) is a recently established computational imaging task that focuses on extracting standardized, catalog-style images of garments from photographs of clothed individuals. In contrast to classical Virtual Try-On (VTON), which synthesizes an image of a person wearing a supplied garment image, VTOFF inverts this mapping—removing the person, pose, and background in order to recover the garment in a canonical, neutral-pose presentation. VTOFF enables strict, instance-level evaluation of garment reconstruction fidelity, underpins new evaluation protocols for generative models, and streamlines e-commerce imagery pipelines. The task poses significant algorithmic challenges due to garment occlusion, deformations, and imaging variability. Key technical advances in VTOFF research include latent diffusion architectures with high-capacity visual conditioning, multimodal attention mechanisms, and new metrics for fine-grained evaluation.

1. Formal Definition and Task Structure

Let I∈RH×W×3I \in \mathbb{R}^{H \times W \times 3} denote an input photograph of a real person wearing a garment, and G∈{0,…,255}H×W×3G \in \{0, \ldots, 255\}^{H \times W \times 3} the desired catalog-style product image of that garment. The VTOFF model aims to learn the conditional distribution Q(G∣C=I)Q(G \mid C=I) that closely approximates the real distribution P(G∣C)P(G \mid C), where CC represents the context provided in the observed photograph. The model is tasked to reconstruct both the global garment silhouette and fine local details (logos, patterns, stitching) under conditions of occlusion, non-rigid deformation, and variable illumination.

Precise quantitative evaluation is possible in VTOFF, as each input photograph is paired with a unique, well-aligned target garment image. This contrasts with VTON, where many plausible outputs may exist for the same conditioning pair, complicating metric selection and benchmarking (Velioglu et al., 27 Nov 2024).

2. Technical Challenges and Design Considerations

Key sources of difficulty in VTOFF include:

  • Occlusion and Deformation: Real-world reference photos may show only partial garment visibility, with critical features hidden by body pose, hair, or accessories.
  • Appearance Variability: Heterogeneous lighting, background complexity, and camera artifacts complicate feature extraction and consistent canonicalization.
  • Strict Output Standardization: Outputs are required to fit tightly defined visual protocols (e.g., centered, white background, canonical pose) as found in e-commerce catalogs, limiting permissible stylistic variation.

These constraints drive the design of VTOFF architectures toward strong spatial reasoning, occlusion-tolerant encodings, and precise low-level feature pathways.

3. State-of-the-Art Models: TryOffDiff and TEMU-VTOFF

TryOffDiff (Velioglu et al., 27 Nov 2024, Velioglu et al., 17 Apr 2025) exemplifies the diffusion-based approach to VTOFF:

  • Architecture: TryOffDiff adapts a Stable Diffusion v1.4 latent diffusion model for image-to-image translation, replacing text conditioning with SigLIP-derived visual tokens. The SigLIP visual encoder, a variant of the CLIP-ViT family, produces 1024 spatial tokens per image; these are passed through a trainable light-weight Transformer and projection layers, yielding 77 conditioning tokens for cross-attention in the denoising U-Net.
  • Training: The U-Net and adapters are trained with the standard diffusion noise reconstruction loss, while SigLIP, the VAE, and other backbone modules remain frozen.
  • Performance: On a cleaned VITON-HD subset, TryOffDiff achieves DISTS=23.0, outperforming VTON-adapted baselines (CatVTON DISTS=28.2). On FID, TryOffDiff reports 25.1 vs 31.4, with similarly strong gains in SSIM.

TEMU-VTOFF (Lobba et al., 27 May 2025) extends VTOFF to the multi-category and multimodal setting:

  • Dual DiT Backbone: TEMU-VTOFF employs two Diffusion Transformers—one to extract garment-aware features (via concatenation of masked latents and segmentation masks) and one as the main generator, both based on the SD3 Medium DiT design. Cross-modality attention is implemented as a three-way merge of visual, text, and extractor tokens in each block.
  • Alignment and Losses: A garment alignment module projects intermediate generator features and DINOv2 image encoder outputs to a common space, optimizing a cosine similarity loss for high-frequency texture and structure preservation.
  • Results: On the DressCode multi-category dataset, TEMU-VTOFF attains DISTS=18.66 and FID=5.74, a significant improvement over prior art (Any2AnyTryon DISTS=25.17; FID=12.32; TryOffDiff DISTS=29.88; FID=70.02).

These models demonstrate that latent diffusion with high-capacity, visually-grounded cross-attention can accurately reconstruct detailed garment imagery and generalize across garment classes and real-world scenes.

4. Model Architecture Components

Visual Conditioning

High-fidelity visual embedders such as SigLIP or CLIP-ViT extract spatially detailed features from the input image. In TryOffDiff, these are projected to cross-attention keys/values at every U-Net stage, anchoring generation to localized garment cues.

Adapter and Fusion Modules

Lightweight Transformer adapters and projection layers bridge high-dimensional image features to the input size required by the diffusion backbone. In TEMU-VTOFF, multimodal hybrid attention (MHA) blocks concatenate and blend diffusion latents, text features (zero-shot attribute captions, T5/CLIP), and feature extractor outputs.

Loss Functions and Training

All models implement the canonical diffusion loss:

L=Ex0,t,ε[∥ε−εθ(xt,t,cimg)∥2],\mathcal{L} = \mathbb{E}_{x_0, t, \varepsilon} \Big[ \lVert \varepsilon - \varepsilon_\theta(x_t, t, c_\text{img}) \rVert^2 \Big],

where xtx_t is the noised latent at step tt, ε\varepsilon is Gaussian noise, cimgc_\text{img} the visual condition, and εθ\varepsilon_\theta the predicted noise. TEMU-VTOFF incorporates an additional alignment loss for high-frequency detail.

Evaluation Metrics

Conventional metrics (SSIM, FID, LPIPS) are found to be insufficient for VTOFF, as they do not penalize structural errors (e.g., missing sleeves or incorrect patterns) reliably. The Deep Image Structure and Texture Similarity (DISTS) metric [Ding et al. PAMI 2020] is recommended, as it jointly assesses shape and texture differences on VGG-derived feature spaces. DISTS correlates better with human perception of garment fidelity (Velioglu et al., 27 Nov 2024).

5. Applications, Benchmarks, and Future Directions

VTOFF has high-impact applications in e-commerce, allowing automatic, large-scale generation of standardized product images from consumer-submitted photos or web imagery. These outputs can seed product catalogs or serve as intermediate representations for downstream try-on or person-to-person try-on pipelines (Velioglu et al., 27 Nov 2024, Velioglu et al., 17 Apr 2025). In particular, TryOffDiff garment reconstructions can replace ground-truth images in OOTDiffusion pipelines with negligible perceptual quality loss.

The multi-modal extension (TEMU-VTOFF) generalizes VTOFF to lower-body, dress, and multi-garment cases, using mask and text guidance to resolve hidden or ambiguous areas (Lobba et al., 27 May 2025). The well-defined reference image in VTOFF enables rigorous benchmarking of generative reconstruction models, facilitating reproducible research and more precise capability assessments.

Proposed future directions focus on:

  • Incorporating geometric or 3D priors to better recover occluded or complex garment structures.
  • Integrating perceptual/adversarial losses for ultra-fine detail recovery.
  • Leveraging new diffusion backbones (e.g., SD-3 Large) and richer multimodal conditioning (joint text/image prompts) to enhance controllability.
  • Developing better evaluation protocols and user studies that assess visual quality and practical usability.

Known limitations of current VTOFF models include residual difficulty under severe garment occlusion, limited granularity for ultra-fine logos or text, and class imbalance effects, particularly on lower-body garment reconstruction (Lobba et al., 27 May 2025).

6. Quantitative Performance Comparison Table

Model Dataset DISTS ↓ FID ↓ SSIM ↑ Multi-category Support
TryOffDiff VITON-HD 23.0 25.1 79.5 No
Any2AnyTryon DressCode 25.17 12.32 77.56 Yes
TEMU-VTOFF DressCode 18.66 5.74 75.95 Yes

DISTS and FID are lower-better, SSIM is higher-better. Multi-category support indicates whether the model supports upper-body, lower-body, and dresses within a unified architecture.

7. Benchmarking and Reference Use Cases

The VTOFF framework has enabled new evaluation tracks for generative models, providing a well-defined, pixel-aligned ground truth for each generated output. This addresses a core limitation of VTON where subjective plausibility is a confounding factor in metric selection. In comparative studies, compositional and single-stage diffusion methods (TryOffDiff, TEMU-VTOFF) demonstrate robust generalization and superior recoveries of complex garment patterns relative to VTON-modified baselines and GAN-pose transfers.

In contemporary deployments, integrating VTOFF reconstructions into multi-stage fashion pipelines enhances input standardization, reduces attribute leakage (e.g., skin color or pose transfer), and improves the reliability of high-level person-to-person try-on and catalog synthesis (Velioglu et al., 27 Nov 2024, Velioglu et al., 17 Apr 2025, Lobba et al., 27 May 2025).

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Virtual-TryOff.