Virtual Fitting Room Systems

Updated 4 July 2026

Virtual Fitting Room (VFR) is a digital system that uses anthropometric data, avatar generation, and cloth simulation to visualize garments without physical trials.
The technology integrates diverse methods including image-based try-on, parametric 3D modeling, GAN/diffusion synthesis, and video generation to address realistic fit and appearance.
Recent advancements emphasize interactive deployment and personalized design by combining contour-based segmentation with diffusion-based generative models to improve fit fidelity.

Virtual fitting room (VFR) denotes a class of systems that digitally visualize garments or other wearables on a specific person without physical trial. In the literature covered here, VFR includes image-based virtual try-on, fully 3D animatable fit-on, accessory-specific trial systems, and long-video generation. The field has evolved from anthropometry, avatar generation, and cloth simulation pipelines toward diffusion-based conditional generation, fit-aware supervision, and interactive deployment, while a recurring research distinction has emerged between photorealistic appearance and authentic fitting fidelity (Gunatilake et al., 2022, Kuruppu et al., 2022, Ning et al., 10 Jun 2026).

1. Research scope and system families

A survey-oriented formulation characterizes VFR as an integration of anthropometric data usage, avatar generation, real-time tracking technologies, cloth modeling and simulation, and ease allowance modeling. The same review discusses generic body models, laser scanning, markers, depth cameras, and geometrical, physical, and hybrid virtual clothing methods, and treats ease allowance as a major factor in virtual cloth fitting (Gunatilake et al., 2022).

Contemporary work instantiates this agenda through several distinct technical families.

Family	Representative papers	Core representation
Anthropometry- and mask-driven image VFR	(Kuribayashi et al., 2023, Chen et al., 2024)	OpenPose or DensePose regions with explicit size-controlled masks
Parametric 3D and animatable VFR	(Kuruppu et al., 2022, Joshi et al., 2024)	SMPL, FLAME, or DECA meshes with rigid alignment and rendering
GAN- and diffusion-based image synthesis	(Attallah et al., 2024, Sun et al., 2024, Jiang et al., 2024)	Conditional image generation with garment-conditioned warping or denoising
Video VFR	(Li et al., 15 Jan 2025, Chen et al., 4 Sep 2025)	Spatio-temporal diffusion or segment-by-segment auto-regressive generation

This taxonomy clarifies a common misconception: VFR is not reducible to 2D texture replacement. Some systems explicitly reconstruct an animatable 3D human model from a single RGB image and fit 3D garment meshes in Unity or Blender, whereas others remain image-based but introduce explicit size, pose, or temporal control (Kuruppu et al., 2022, Li et al., 15 Jan 2025). A second misconception is that VFR is limited to apparel; one system builds a virtual trial room for glasses by reconstructing a 3D head with DECA and fitting a custom 3D glasses model exported as glb/glTF with PBR materials (Joshi et al., 2024).

2. Human representation and personalization

The representation of the wearer is a central design choice. In anthropometry-driven pipelines, the body is parameterized by length and girth measurements such as stature, crotch length, arm length, neck girth, chest or bust girth, waist girth, hip girth, and thigh girth, and mapped to a deformable template through a low-dimensional shape vector $\alpha$ with

$V(\alpha) = V_0 + \sum_i \alpha_i S_i.$

The same formulation can be fit by minimizing a regularized measurement error, and the survey literature treats this as a foundation for scalable avatar generation (Gunatilake et al., 2022).

A fully personalized 3D route is exemplified by the single-image pipeline of Kuruppu et al. Their body model uses SMPL, which represents an articulated, tri-mesh model $M$ with $N=6890$ vertices, controlled by a shape vector $\beta \in \mathbb{R}^{|\beta|}$ with $|\beta|=10$ –$20$ principal components and a pose vector $\theta \in \mathbb{R}^{|\theta|}$ with $|\theta|=72$ joint axis-angle parameters. The posed body is given by

$M(\beta,\theta) = W(\bar T + B_s(\beta) + B_p(\theta), J(\beta), \theta, W),$

and $V(\alpha) = V_0 + \sum_i \alpha_i S_i.$ 0 are extracted by running SMPLify on the input RGB to minimize a 2D-joint reprojection loss plus pose and shape priors (Kuruppu et al., 2022). To restore identity beyond the body mesh, the same pipeline fits FLAME on the image to recover a head mesh with 5023 vertices, aligns the FLAME neck rim to the SMPL neck cut by a rigid transform in $V(\alpha) = V_0 + \sum_i \alpha_i S_i.$ 1, then stitches the meshes and propagates UV texture to hide seams (Kuruppu et al., 2022).

Not all VFR systems use parametric meshes. SiCo explicitly encodes the body as semantic segmentation regions $V(\alpha) = V_0 + \sum_i \alpha_i S_i.$ 2 from DensePose rather than a parametric mesh such as SMPL, and forms a garment-type-specific body mask by taking the union of selected regions (Chen et al., 2024). This is a materially different design choice: personalization is represented in the image plane through body-aware segmentation and contour preservation rather than through a deformable 3D body prior.

Head-centric accessory trial systems occupy another point in this design space. The DECA-based virtual trial room reconstructs a 3D head mesh $V(\alpha) = V_0 + \sum_i \alpha_i S_i.$ 3, albedo map $V(\alpha) = V_0 + \sum_i \alpha_i S_i.$ 4, UV displacement map $V(\alpha) = V_0 + \sum_i \alpha_i S_i.$ 5, camera parameters $V(\alpha) = V_0 + \sum_i \alpha_i S_i.$ 6, and lighting parameters $V(\alpha) = V_0 + \sum_i \alpha_i S_i.$ 7 from a cropped face image, and models the face with FLAME-based coefficients for shape, expression, and pose:

$V(\alpha) = V_0 + \sum_i \alpha_i S_i.$ 8

This enables direct alignment of a product model, here glasses, using eye-center anchors and rigid optimization by Procrustes or Gauss-Newton (Joshi et al., 2024).

Across these lines of work, personalization increasingly means more than recovering body dimensions. The explicit stitching of FLAME onto SMPL was motivated by the observation that many virtual fit-on systems lack realism because they are predominantly 2D or do not use the user’s facial features in the dressed model (Kuruppu et al., 2022). This suggests that identity preservation is not merely a perceptual refinement but a structural requirement for trust in VFR outputs.

3. Garment modeling, deformation, and fit control

Garment modeling in VFR ranges from classical physical simulation to learned mask and prompt control. The survey literature distinguishes geometric models, physical models, and hybrid models, and treats ease allowance as a first-order fitting variable through

$V(\alpha) = V_0 + \sum_i \alpha_i S_i.$ 9

It also notes that fit quality can be summarized by minimum distances over critical cross-sections and ease violations, thereby grounding VFR in garment-body geometry rather than image realism alone (Gunatilake et al., 2022).

In a 3D animatable setting, each garment can be represented as a separate tri-mesh $M$ 0 with its own UV and physical properties, including mass and stiffness. Kuruppu et al. initialize garment placement by regressing scale and translation from body shape, then run Unity or Blender’s cloth simulator under gravity with collision against the fitted body until rest. They also write cloth fitting as a constrained optimization with

$M$ 1

where body attraction, collision penalties, and internal spring smoothness govern the final state (Kuruppu et al., 2022). This representation is animatable because the combined avatar reuses the SMPL skeleton and nearest-vertex skinning weights.

Image-based size control takes a different route. The 2023 clothing-size-adjustment system detects 25 body keypoints with OpenPose and measures torso height, shoulder width, and an arm-span component in image space. Real-world garment dimensions are converted to scaling ratios $M$ 2 and $M$ 3, after which only the clothing region of the VITON-HD segmentation map is resized. Visual defects are then corrected with collar-gap erosion, arm-cloth overlap realignment, and thin-plate spline warping of the garment image (Kuribayashi et al., 2023). The method works best for T-shirts or tops, assumes a reasonably frontal pose with OpenPose confidence $M$ 4, and assumes uniform fabric elasticity (Kuribayashi et al., 2023).

SiCo makes the size-control mechanism explicit. Users provide a true size index $M$ 5 and select a garment size index $M$ 6, with $M$ 7 and $M$ 8, and the system defines $M$ 9. Starting from a regular-fit body mask $N=6890$ 0, the try-on mask is computed analytically as

$N=6890$ 1

The top edge is never dilated, specifically to avoid “levitation” (Chen et al., 2024). An important clarification follows directly from the paper: SiCo does not automatically recommend a size via scoring functions; it provides explicit $N=6890$ 2 displays and side-by-side renderings to support informed comparison (Chen et al., 2024). VFR should therefore not be conflated with automated size recommendation.

More recent work pushes fit control into the generative model itself. FitVTON encodes garment-body size through structured text prompts of the form “[Gender], [body-height category], [body-size category], [wearing style]”, learns from simulated try-on triplets generated with GarmentCode procedural sewing patterns and SMPL-X bodies, and supervises training with auxiliary garment and exposed-body mask heads (Ning et al., 10 Jun 2026). The training set is approximately $N=6890$ 3 garments $N=6890$ 4 bodies $N=6890$ 5 poses $N=6890$ 6 triplets, and the goal is explicitly to correct the common failure mode of diffusion VTON systems that prioritize texture preservation over physical plausibility (Ning et al., 10 Jun 2026).

4. Image synthesis architectures

Early image-based VFR treated the task as localized detection and appearance transfer. One 2021 system first used Mask R-CNN with a ResNet-101-FPN backbone to detect and segment fashion items and then applied Neural Style Transfer only within the selected item’s mask. On PaperDoll images with ModaNet annotations, the best model, M5, reported $N=6890$ 7 mAP and $N=6890$ 8 ASDR, with style transfer preserving fabric wrinkles and leaving non-target regions unchanged (Huang et al., 2021). This formulation is useful historically because it isolates segmentation quality from garment appearance editing.

GAN-based VFR subsequently integrated alignment and synthesis more tightly. A cost-efficient approach uses a two-stage conditional GAN in which a conditional generator predicts a refined segmentation map and an appearance flow field, and an image generator with SPADE normalization fuses these with an agnostic body representation. The full pipeline reported FID $N=6890$ 9, SSIM $\beta \in \mathbb{R}^{|\beta|}$ 0, preprocessing plus GAN synthesis of $\beta \in \mathbb{R}^{|\beta|}$ 1 s per try-on on a Tesla T4, and an end-to-end response of $\beta \in \mathbb{R}^{|\beta|}$ 2 s including offline preprocessing; it currently supports only upper-body garments (Attallah et al., 2024).

Diffusion models redefined the architecture of image VFR. OutfitAnyone splits a pretrained Stable Diffusion U-Net into a model stream and a garment stream, injects garment features through cross-attention, and learns non-rigid garment deformation implicitly rather than through explicit TPS or flow. On VITON-HD it reports FID $\beta \in \mathbb{R}^{|\beta|}$ 3, LPIPS $\beta \in \mathbb{R}^{|\beta|}$ 4, and SSIM $\beta \in \mathbb{R}^{|\beta|}$ 5, together with inference times of $\beta \in \mathbb{R}^{|\beta|}$ 6 s for $\beta \in \mathbb{R}^{|\beta|}$ 7 in $\beta \in \mathbb{R}^{|\beta|}$ 8 DDIM steps and $\beta \in \mathbb{R}^{|\beta|}$ 9 s for $|\beta|=10$ 0 in $|\beta|=10$ 1 steps on a single A100 (Sun et al., 2024).

Two prominent 2024 extensions target breadth and fidelity. AnyFit introduces the Hydra Block, in which only self-attention layers are parallelized across garment branches while the rest of the U-Net is shared, increasing parameters by only $|\beta|=10$ 2 per extra branch. It combines this with residual synthesis of multiple pretrained models and a mask region boost strategy, reports FID/KID improvements of $|\beta|=10$ 3– $|\beta|=10$ 4 over strong baselines on single-garment benchmarks, and reaches the best multi-garment FID of $|\beta|=10$ 5 with only $|\beta|=10$ 6 inference time (Li et al., 2024). FitDiT instead centers the architecture on Diffusion Transformers, allocating more attention-related parameters to a latent of size $|\beta|=10$ 7 for $|\beta|=10$ 8 inputs, adds a garment texture extractor with garment priors evolution, and introduces a frequency-distance loss

$|\beta|=10$ 9

On DressCode paired evaluation it reports SSIM $20$0, LPIPS $20$1, FID $20$2, KID $20$3; on VITON-HD paired evaluation it reports SSIM $20$4, LPIPS $20$5, FID $20$6, KID $20$7; and its inference time is $20$8 s for a single $20$9 image after DiT structure slimming (Jiang et al., 2024).

A separate line embeds VFR inside a broader shopping loop. HMaVTON combines retrieval-based and generative matching with an enhanced virtual try-on module. The retrieval branch uses a VBPR-style matching score, the generative branch uses a shape-constrained ControlNet-based diffusion model, and the try-on branch refines cloth alignment with flow fields and denoising inpainting. On VITON-HD it reports SSIM $\theta \in \mathbb{R}^{|\theta|}$ 0, FID $\theta \in \mathbb{R}^{|\theta|}$ 1, and LPIPS $\theta \in \mathbb{R}^{|\theta|}$ 2, while a professional evaluation with fashion designers yields a weighted match score of $\theta \in \mathbb{R}^{|\theta|}$ 3 versus $\theta \in \mathbb{R}^{|\theta|}$ 4 for the next best hybrid baseline (Yu et al., 2024).

Taken together, these systems show a clear architectural transition: explicit warp fields and segmentation pipelines remain important, but much of garment deformation has migrated into conditional attention, high-resolution latent modeling, and multimodal conditioning. This suggests that “virtual fitting” in current image VFR is often implemented as a controlled generative process rather than as a purely geometric overlay.

5. Interaction, deployment, and commercial integration

User interaction is no longer an afterthought in VFR research. SiCo is organized as a two-page web interface. On Page 1, users upload a frontal “regular-fit” self-image and select their “true size” for tops or bottoms from $\theta \in \mathbb{R}^{|\theta|}$ 5. On Page 2, they browse garments with precomputed metadata, choose an item, select a test size $\theta \in \mathbb{R}^{|\theta|}$ 6, and append results to a “Try-On Results” area with a “Continue From Here” button for sequential multi-garment styling (Chen et al., 2024). In a user study with $\theta \in \mathbb{R}^{|\theta|}$ 7, chi-square tests on $\theta \in \mathbb{R}^{|\theta|}$ 8 contingency tables produced Cramer’s $\theta \in \mathbb{R}^{|\theta|}$ 9 values above $|\theta|=72$ 0, all six Likert questions favored size-controllable VTO over baseline with $|\theta|=72$ 1 for sense or look, suitability, and future-use, NASA-TLX showed one significant performance change with $|\theta|=72$ 2 when the self-image was removed, and SUS showed one significant drop in consistency with $|\theta|=72$ 3 when model images replaced self-images (Chen et al., 2024). The same study reports that identity preservation via contour guidance was repeatedly cited as critical for trust (Chen et al., 2024).

Mobile deployment changes both systems engineering and privacy assumptions. Mobile Fitting Room fine-tunes Stable Diffusion v1-5 with DreamBooth for garment-specific conditioning, compresses the model by 6-bit palettization, U-Net chunking, split-einsum attention, and Core ML compilation, and performs inpainting directly on-device from a user photo, a user-drawn mask, and a garment token. After 6-bit palettization, model size decreases by $|\theta|=72$ 4, end-to-end inference is typically $|\theta|=72$ 5– $|\theta|=72$ 6 seconds per $|\theta|=72$ 7 image on an iPad Pro, and all processing stays on the user’s device, enabling offline availability and avoiding server-side storage of personal images (Blalock et al., 2024). This directly counters the assumption that diffusion-based VFR must be cloud-native.

Texture-only customization is another commercially relevant branch. The AIGC-based custom cloth creation system starts from pretrained 3D garment meshes with UV maps, uses semantic UV masks for sleeves, collars, and other parts, and exposes three editing modes: color modification, texture modification through a text prompt, and logo printing. It reports texture generation latency of $|\theta|=72$ 8 s for a single $|\theta|=72$ 9 patch, a UI frame rate of $M(\beta,\theta) = W(\bar T + B_s(\beta) + B_p(\theta), J(\beta), \theta, W),$ 0 FPS during camera or pose adjustment, $M(\beta,\theta) = W(\bar T + B_s(\beta) + B_p(\theta), J(\beta), \theta, W),$ 1 faster end-to-end customization than manual 3D texture painting, and $M(\beta,\theta) = W(\bar T + B_s(\beta) + B_p(\theta), J(\beta), \theta, W),$ 2 reduction in artist labor hours for each seasonal apparel update, but it includes no dynamic cloth simulation and no formal user study (Chen et al., 2024).

Accessory-oriented VFR shows how these ideas generalize beyond garments. The DECA-based virtual trial room for glasses uses Flask on the back end, React and Three.js on the front end, and exports fitted assets as glb for an interactive web viewer. It reports DECA-based 3D reconstruction accuracy on the Feng et al. benchmark with median error $M(\beta,\theta) = W(\bar T + B_s(\beta) + B_p(\theta), J(\beta), \theta, W),$ 3 mm, mean error $M(\beta,\theta) = W(\bar T + B_s(\beta) + B_p(\theta), J(\beta), \theta, W),$ 4 mm, and std $M(\beta,\theta) = W(\bar T + B_s(\beta) + B_p(\theta), J(\beta), \theta, W),$ 5 mm, as well as typical latency of $M(\beta,\theta) = W(\bar T + B_s(\beta) + B_p(\theta), J(\beta), \theta, W),$ 6 s end-to-end and an informal user study with $M(\beta,\theta) = W(\bar T + B_s(\beta) + B_p(\theta), J(\beta), \theta, W),$ 7 in which mean satisfaction is $M(\beta,\theta) = W(\bar T + B_s(\beta) + B_p(\theta), J(\beta), \theta, W),$ 8 and $M(\beta,\theta) = W(\bar T + B_s(\beta) + B_p(\theta), J(\beta), \theta, W),$ 9 would consider purchasing with such a preview (Joshi et al., 2024).

Commercial integration also increasingly includes recommendation logic. HMaVTON explicitly frames VFR as a one-stop shopping service that recommends available items or generated alternatives and then produces the try-on result, rather than treating recommendation and visualization as separate systems (Yu et al., 2024). A plausible implication is that future VFR platforms will be judged not only by rendering quality but also by how effectively they connect visualization, inventory, and decision support.

6. Video generation, evaluation, and unresolved issues

Video VFR introduces a qualitatively different problem: maintaining garment identity and body-garment consistency across extended temporal horizons. RealVVT builds on Stable Video Diffusion with a Spatial-Temporal U-Net and a Reference U-Net, introduces Clothing & Temporal Consistency by concatenating current-frame, random distant-frame, and garment-reference features in attention, and adds an Agnostic-guided Attention Focus Loss with $V(\alpha) = V_0 + \sum_i \alpha_i S_i.$ 00 and $V(\alpha) = V_0 + \sum_i \alpha_i S_i.$ 01 to force spatial focus inside the agnostic mask (Li et al., 15 Jan 2025). On the VVT dataset at $V(\alpha) = V_0 + \sum_i \alpha_i S_i.$ 02, it reports SSIM $V(\alpha) = V_0 + \sum_i \alpha_i S_i.$ 03, LPIPS $V(\alpha) = V_0 + \sum_i \alpha_i S_i.$ 04, VFID $V(\alpha) = V_0 + \sum_i \alpha_i S_i.$ 05, and VFID $V(\alpha) = V_0 + \sum_i \alpha_i S_i.$ 06; on VITON-HD at $V(\alpha) = V_0 + \sum_i \alpha_i S_i.$ 07, it reports SSIM $V(\alpha) = V_0 + \sum_i \alpha_i S_i.$ 08, LPIPS $V(\alpha) = V_0 + \sum_i \alpha_i S_i.$ 09, FID $V(\alpha) = V_0 + \sum_i \alpha_i S_i.$ 10, and KID $V(\alpha) = V_0 + \sum_i \alpha_i S_i.$ 11 (Li et al., 15 Jan 2025).

A more ambitious long-video formulation models VFR as segment-by-segment auto-regressive generation. The technical preview titled “Virtual Fitting Room” takes a single user image, a reference garment image, and a long motion video, and generates minute-scale try-on videos by conditioning each segment on an overlapping prefix and on an anchor video, specifically a 360-degree A-pose clip of the person wearing the garment (Chen et al., 4 Sep 2025). It reports $V(\alpha) = V_0 + \sum_i \alpha_i S_i.$ 12 resolution at up to $V(\alpha) = V_0 + \sum_i \alpha_i S_i.$ 13 FPS, subject consistency $V(\alpha) = V_0 + \sum_i \alpha_i S_i.$ 14, background consistency $V(\alpha) = V_0 + \sum_i \alpha_i S_i.$ 15, motion smoothness $V(\alpha) = V_0 + \sum_i \alpha_i S_i.$ 16, and GPT-based try-on quality $V(\alpha) = V_0 + \sum_i \alpha_i S_i.$ 17 on a $V(\alpha) = V_0 + \sum_i \alpha_i S_i.$ 18 s setting, but generation of a $V(\alpha) = V_0 + \sum_i \alpha_i S_i.$ 19 s video still takes $V(\alpha) = V_0 + \sum_i \alpha_i S_i.$ 20– $V(\alpha) = V_0 + \sum_i \alpha_i S_i.$ 21 hours and depends on a high-quality anchor (Chen et al., 4 Sep 2025).

Evaluation methodology is also shifting. FitVTON argues that many diffusion-based systems generate plausible-looking images that fail to reflect authentic garment fit, then introduces FittingEffect3K with $V(\alpha) = V_0 + \sum_i \alpha_i S_i.$ 22 real try-on triplets built from $V(\alpha) = V_0 + \sum_i \alpha_i S_i.$ 23 medium-sized real garments, $V(\alpha) = V_0 + \sum_i \alpha_i S_i.$ 24 human participants, and $V(\alpha) = V_0 + \sum_i \alpha_i S_i.$ 25 poses each (Ning et al., 10 Jun 2026). Its VLM protocol uses GPT-5.2 with temperature $V(\alpha) = V_0 + \sum_i \alpha_i S_i.$ 26 and scores four fit dimensions on a $V(\alpha) = V_0 + \sum_i \alpha_i S_i.$ 27– $V(\alpha) = V_0 + \sum_i \alpha_i S_i.$ 28 scale: Garment–Body Alignment, Tightness/Looseness Consistency, Silhouette Consistency, and Local Fit Artifacts. The reported repeated-run stability is $V(\alpha) = V_0 + \sum_i \alpha_i S_i.$ 29 with explanation similarity $V(\alpha) = V_0 + \sum_i \alpha_i S_i.$ 30, the whole-average fit score is $V(\alpha) = V_0 + \sum_i \alpha_i S_i.$ 31, and ablating dual-branch mask supervision reduces the score to $V(\alpha) = V_0 + \sum_i \alpha_i S_i.$ 32, a gain of $V(\alpha) = V_0 + \sum_i \alpha_i S_i.$ 33 from the masks (Ning et al., 10 Jun 2026). This suggests that standard image metrics such as FID, KID, SSIM, or LPIPS are no longer sufficient as standalone proxies for fit fidelity.

Several limitations recur across the literature. The single-image 3D animatable pipeline reports that the back of the head is untextured, hair modeling is not yet supported, and cloth fitting temporarily relies on sim-heavy cloth physics (Kuruppu et al., 2022). FitDiT occasionally struggles with very complex hand and finger poses and can still lose detail for heavily textured rare fabrics or out-of-distribution backgrounds (Jiang et al., 2024). RealVVT notes that extremely loose or highly deformed garments such as billowing coats remain difficult in long sequences (Li et al., 15 Jan 2025). The long-video autoregressive VFR remains far from real-time and has limited ablation on anchor length, overlap size, and segment length (Chen et al., 4 Sep 2025).

The convergence of these findings points to a mature but unsettled research area. VFR is no longer defined by a single pipeline template: it now encompasses anthropometric avatar fitting, contour-preserving image editing, high-resolution diffusion, structured fit supervision, and long-horizon video generation. The dominant open question is not whether VFR can produce realistic images, but how faithfully it can encode garment-body interaction, size semantics, and temporal stability under the computational and interface constraints of real deployment.