Person-to-Person Virtual Try-On (p2p-VTON)

Updated 1 June 2026

p2p-VTON is a conditional image synthesis task that generates a photorealistic image of a target person wearing a garment from a source person.
It tackles challenges like garment disentanglement, pose misalignment, occlusions, and attribute leakage to ensure accurate garment transfer.
Modern approaches leverage dual-encoder diffusion models, transformer-based attention, and modular pipelines to enhance realism and efficiency.

Person-to-Person Virtual Try-On (p2p-VTON) refers to the conditional image synthesis task of generating a photorealistic image of a target person wearing a garment observed only as worn by a source person, without requiring access to a clean, shop-style garment image. This scenario, which generalizes classical garment-to-person virtual try-on, introduces unique challenges in precise garment extraction, deformation, and identity preservation, especially under variable poses, occlusions, and background conditions.

1. Problem Definition and Core Challenges

The p2p-VTON task is formalized as: given a source person image $I^s$ (displaying the desired garment in an arbitrary pose) and a target person image $I^t$ , synthesize $\hat{I}^{t,s}$ —the target person rendered in the appearance and draping of the source person’s garment, while retaining all non-garment-specific attributes (body shape, face, background) from the target (Velioglu et al., 17 Apr 2025).

Key distinguishing challenges include:

Garment disentanglement: Unlike catalog try-on, the garment is observed in situ, occluded, wrinkled, and entangled with source-specific body and lighting cues.
Pose and shape misalignment: Source and target persons may differ significantly in pose, build, or viewpoint, complicating direct pixel transfer.
Unwanted attribute leakage: Skin tone, background, or contextual artifacts may “leak” from the source into the synthesized output, degrading realism (Velioglu et al., 17 Apr 2025).
Scarcity of paired data: There is a lack of large-scale, high-quality datasets containing the same garment worn by different people in distinct settings (Wang et al., 21 Jul 2025, Shen et al., 3 Feb 2025).

2. Architectural Paradigms and Methodological Advances

a. Two-Stream and Dual-Encoder Diffusion Architectures

OutfitAnyone (Sun et al., 2024) introduced a two-stream conditional diffusion model which extends Stable Diffusion’s latent diffusion framework. Separate UNet branches encode the person and garment (as observed on the source), fusing their features via mid-level cross-attention and ending with a post-hoc diffusion-based refiner. This architecture dispenses with explicit geometric warping, instead learning garment deformation through cross-attention between latent representations.

FW-VTON (Wang et al., 21 Jul 2025) explicitly decomposes the task into three modules—flattening, warping, and integration—each operationalized by dual UNets. Flattening reconstructs a clean, “de-occluded” garment image from the partial, warped observation; warping aligns this proxy to the target pose; integration fuses garment structure and high-frequency detail onto the target.

b. Attention-Driven and Transformer-Based Diffusion Models

Recent mask-free architectures employ diffusion transformers to avoid explicit masking or parsing at inference. MFP-VTON (Shen et al., 3 Feb 2025) concatenates the source and target images (plus a blank inpainting panel) across the conditioning width, directing the diffusion model to perform synthesis only in desired regions. A novel Focus Attention loss reinforces attention on the correct garment (reference) and context (target) areas.

Any2AnyTryon (Guo et al., 27 Jan 2025) leverages a large precompiled dataset (LAION-Garment) and introduces Adaptive Position Embeddings, explicitly communicating spatial and conditioning context to a transformer-based generator, while enabling flexible task instructions via a text encoder. This supports both mask-free and pose-free synthesis with unified architecture, allowing arbitrary conditioning and layering.

c. Attribute-Aware and Warping Modules

PL-VTON (Han et al., 16 Mar 2025) instantiates progressive, attribute-driven warping using a two-stage Multi-attribute Clothing Warping (MCW) module: global pre-alignment (shift/scale) followed by multi-scale, appearance-flow-based pixel deformation. Subsequent human parsing and limb-aware fusion ensure accurate spatial structure and correct skin-cloth boundaries.

LGVTON (Roy et al., 2020) and classical TPS-based systems combine explicit pose and landmark alignment (thin-plate spline warp with human and fashion landmarks) with mask refinement, but these approaches are now mostly superseded by attention- and diffusion-based alignment in SOTA methods.

d. Hybrid Modular Pipelines

TryOffDiff (Velioglu et al., 17 Apr 2025) further separates garment extraction (“virtual try-off”) from garment application (“virtual try-on”), training a SigLIP-conditioned diffusion model to reconstruct a clean, de-personalized garment image from the source, thereby improving subsequent try-on when composited by a VTON module (e.g. OOTDiffusion), and reducing attribute leakage.

3. Data Regimes, Construction, and Preprocessing

The lack of large-scale, paired p2p datasets motivated multiple strategies:

Synthetic Pseudo-Pairs: MFP-VTON (Shen et al., 3 Feb 2025) generates pseudo triplets by running a garment-swap model on garment-to-person pairs, ensuring that resulting person-to-person mappings support principled supervised learning.
Custom p2p-VTON datasets: FW-VTON (Wang et al., 21 Jul 2025) collected images of the same individual in multiple poses and outifts, enabling intra-group pairing for training (≈15k), and inter-group pairing for testing generalization.
Unpaired and Augmented Pipelines: BVTON (Yang et al., 2024) proposes a compositional canonicalizing flow to synthesize in-shop–like garment proxies from in-situ clothing and leverages 50k unpaired images, substantially outscaling past paired setups.
Automation Tools: Many recent methods include mask generation (AutoMasker, DensePose, SAM), pose heatmaps, and synthetic instruction or prompt generation to diversify the training signals (Guo et al., 27 Jan 2025).

4. Loss Functions and Optimization Strategies

Loss functions span a range from pixelwise (L1/L2), adversarial (GAN-based), and perceptual (VGG), through to domain-specific constraints such as attention regularization and semantic consistency:

Diffusion Losses: Variants of standard denoising score-matching and flow-matching losses dominate latent diffusion and transformer-based frameworks (Sun et al., 2024, Shen et al., 3 Feb 2025, Guo et al., 27 Jan 2025).
Attribute/Attention Regularization: Focus Attention loss in MFP-VTON encourages model attention to concentrate on relevant garment regions.
Geometric Flow and Parsing Losses: Appearance flow and mask regularizers penalize artifacts in warping modules (PL-VTON, BVTON), while cross-entropy and layer-wise GAN losses ensure semantic correspondence and mask quality.
Class-/Category-Conditional Embeddings: TryOffDiff introduces category-specific embeddings injected at each residual block for multi-garment generalization (Velioglu et al., 17 Apr 2025).

5. Evaluation Protocols and Comparative Results

Evaluation references multiple quantitative and qualitative metrics:

Model	Dataset	FID↓	SSIM↑	LPIPS↓	User Study (%)
PL-VTON	VITON	12.16	0.92	–	75–83
FW-VTON	VITON-HD	8.53	–	0.363	66.24
BVTON	p2p split	12.4	0.815	0.098	–
MFP-VTON	VITON-HD	9.32	0.869	0.112	–
LGVTON	MPV	56.11	0.89	–	76.2+

PL-VTON and FW-VTON have reported state-of-the-art FID and SSIM scores on standard splits; ablation studies in each demonstrate the impact of warping modules, dual encoders, and guidance losses (Han et al., 16 Mar 2025, Wang et al., 21 Jul 2025). MFP-VTON and Any2AnyTryon exploit more recent transformer backbones and mask-free approaches and report further advances, although values are not always directly comparable owing to resolution and dataset heterogeneity.

Qualitative assessments emphasize: accurate sleeve/cuff localization, preservation of skin tone and background, clean edge transitions, and reduction of hallucinations or attribute mixing.

6. Video Person-to-Person Try-On

PEMF-VTO (Chang et al., 2024) extends p2p-VTON to video, proposing a mask-free, point-enhanced architecture that introduces sparse point-based correspondences to guide spatial and temporal attention during diffusion. This approach preserves garment integrity, eliminates “flicker” across frames, and enables explicit user or algorithmic control via keypoint supervision, delivering a leap in both frame quality and temporal coherence compared to prior mask-based pipelines.

7. Limitations, Open Challenges, and Future Directions

Despite significant advances, p2p-VTON still faces the following limitations:

Extreme Pose/Garment Deformations: Robustness to foreshortening, self-occlusion, or loose garments remains challenging, especially in highly articulated or atypical scenarios (Sun et al., 2024, Guo et al., 27 Jan 2025).
Resolution and Real-Time Constraints: While SDXL and recent diffusion models support up to 1920×1080, inference remains costly; work on lightweight samplers (e.g., DiT-based, knowledge distillation) is underway (Guo et al., 27 Jan 2025).
Unpaired and Multi-Modal Data: Progress is constrained by available training data—expanding large-scale unpaired, multi-outfit/person datasets with fine-annotated semantics and multi-view coverage is a pressing priority (Yang et al., 2024, Wang et al., 21 Jul 2025).
Layered and Multi-Garment Editing: Full support for multi-layered outfits, accessories, and interactive editing (by user prompt or point) is only nascent (Any2AnyTryon, TryOffDiff) (Velioglu et al., 17 Apr 2025, Guo et al., 27 Jan 2025).
3D and Multi-View Consistency: Most methods remain 2D and single-view; integration of mesh guidance, SMPL tracking, or 3D-aware features is expected to improve both visual fidelity and generalizability (Sun et al., 2024, Chang et al., 2024).

Further research is focusing on mask-free, user-controllable interaction, higher-order garment/scene reasoning, and end-to-end unified architectures capable of efficient, robust, and identity-faithful p2p virtual try-on under arbitrary conditions.