Vanast: Unified Virtual Try-On Animation
- Vanast is a unified framework for generating human animation videos with garment transfer from a single image and pose guidance.
- It employs a dual-module design, combining the Human Animation Module for pose and identity with the Garment Transfer Module for garment fidelity.
- The approach unifies virtual try-on and pose-driven animation in a single diffusion-based framework, reducing issues like identity drift and garment distortion.
Searching arXiv for the Vanast paper and closely related context. Vanast is a unified framework for generating garment-transferred human animation videos directly from a single human image, garment images, and a pose guidance video. It formulates virtual try-on and pose-driven animation as a single-step video generation problem rather than as a two-stage composition of image-based try-on followed by animation. In the reported formulation, the inputs are a human image showing the target subject in an alternative outfit, one or more garment images , a pose guidance video represented as $2$D keypoints, and an optional text prompt for actions or background; the output is an -frame video of the same person wearing the target garments and following the target motion (Cha et al., 6 Apr 2026).
1. Problem formulation and motivation
Vanast addresses a failure mode of conventional virtual try-on pipelines that decompose the task into two stages: first, transferring a target garment onto a still photograph, and second, animating the synthesized image using a pose-driven animation model. In the reported analysis, this decomposition introduces three chronic issues: identity drift, garment distortion and loss of details, and front–back inconsistency (Cha et al., 6 Apr 2026).
Identity drift arises because the animation model is trained on a different distribution and receives a synthetic input from the try-on stage. Garment distortion and detail loss occur because the fine garment cues established in the first stage are not preserved in the second stage, so compounding artifacts accumulate. Front–back inconsistency reflects the fact that a single input image cannot fully encode front, back, and side garment geometry; two-stage pipelines therefore lose garment-specific multi-view cues needed for consistent rendering across viewpoints during motion.
The unified formulation is intended to keep identity, garment, and motion constraints active throughout generation. A plausible implication is that removing the interface between a still-image try-on model and a separate animation model reduces distribution shift and allows motion cues to contribute directly to viewpoint-consistent garment synthesis.
2. Unified generative framework
Vanast is implemented as a video-native diffusion transformer framework that conditions jointly on person identity, garment appearance, pose, and optional text. The conditioning signals are a single human image , garment images , a pose guidance video converted to 0D keypoints with DWPose, and optional text 1. All conditioning signals are encoded to latent tokens and fused into generation through a dedicated Dual Module design that augments a pretrained text-to-video Diffusion Transformer backbone (Cha et al., 6 Apr 2026).
The backbone is described as a pretrained text-to-video DiT akin to WAN’s T2V model. Its temporal modeling is inherited from the underlying text-to-video backbone through spatiotemporal self-attention over frame tokens with learned positional encodings. No architectural changes are made to the backbone; it is frozen during training to preserve generation quality and stabilize optimization.
The central design decision is to split conditioning into two trainable pathways. The Human Animation Module, or HAM, is responsible for identity and pose. It takes the human image and pose keypoint video, encodes a motion-conditioned appearance context, and produces residual updates intended to strengthen pose adherence while preserving facial identity and hair. The Garment Transfer Module, or GTM, is responsible for garment fidelity and detail. It receives one or more garment images and injects residual signals enforcing coherent garment transfer across time.
The hidden-state update rule is specified as
2
with 3 and 4. Residual injections are therefore applied at every even backbone block. The paper characterizes this as distributed, cascaded residual injection that stabilizes training, preserves the pretrained generative prior, and prevents either pose conditioning or garment conditioning from dominating or collapsing.
A notable negative design choice is equally explicit: no ControlNet branches are used. Conditioning flows through HAM and GTM residual injections, and the paper does not specify whether 5-prediction or 6-prediction is used in diffusion, stating only that the method follows the backbone’s default.
3. Conditioning, tokenization, and inference mechanics
The condition inputs are encoded with the pretrained VAE encoder 7 from the text-to-video backbone:
8
These latents are then mapped into the two specialized modules (Cha et al., 6 Apr 2026).
For HAM, the model constructs a motion-conditioned appearance context by concatenating 9 and 0 along the temporal axis in a frame-wise manner, then projecting the result with a 1D convolution into token embeddings. HAM uses attention to fuse its condition tokens with the backbone hidden state and outputs 2.
For GTM, the model uses 3 alone. To match temporal dimension, zero temporal placeholders are appended and the resulting tensor is projected via 4D convolutions into tokens. GTM then attends to garment tokens and outputs 5.
Text prompt 6 is passed to the frozen text-to-video backbone in a typical text-to-video manner and is used to control background or action. For multi-garment cases, all selected garment images are supported; the description states that GTM internally enforces garment category and appearance fidelity.
The inference procedure is correspondingly direct. First, 7D keypoints are extracted from the pose video using DWPose. Second, the human image, garment images, and keypoints are encoded by the backbone VAE to produce 8, 9, and $2$0. Third, HAM tokens are formed by temporal concatenation of $2$1 and $2$2 followed by $2$3D-convolutional projection, while GTM tokens are formed by temporally padding $2$4 and projecting it. Finally, the diffusion sampler of the text-to-video backbone generates an $2$5-frame video. The paper does not specify the sampler type, the number of steps, or runtime characteristics, and it reports that no post-processing is required.
4. Synthetic triplet supervision
A central contribution of Vanast is its synthetic triplet supervision strategy. The required training tuple is $2$6, where $2$7 is a video of a person wearing target garment $2$8, and $2$9 is a single human reference image of the same person in a different outfit 0. The paper states that no public dataset has this form, and therefore introduces three complementary pipelines to construct such supervision at scale (Cha et al., 6 Apr 2026).
The first pipeline starts from online shopping videos paired with garment images. A reference frame 1 is selected by sampling 2 frames and using Qwen2.5-VL to pick a representative frame satisfying the criteria “face unoccluded, both eyes open, near-frontal, quality 3”; if none is found, the first frame is used. Adaptive cropping is then applied by detecting the largest face and full-body boxes, expanding them, randomly interpolating between them, and computing a 4 crop that encloses the interpolated box and is clipped to image bounds. To synthesize an alternative-outfit image 5, the method constructs an inpainting mask 6 designed to break silhouette bias. A text-to-image model such as SDXL synthesizes an auxiliary image in the same pose but with arbitrary identity and garment, SegFormer segments the garment in that auxiliary image, and the resulting mask is used for inpainting. Garment prompts are generated with ChatGPT from a pool of garment types and colors, with gender consistency enforced via Qwen2.5-VL gender detection. FLUX inpainting is then applied to 7 with 8 and the text prompt to produce 9.
The second pipeline starts from in-the-wild videos only and synthesizes both 0 and 1 from 2. For garment image synthesis, the method samples 3 frames, scores frontalness with Qwen2.5-VL, selects the best frame from the top-4 candidates using criteria including full-body visibility, sharpness, minimal occlusion, lighting and contrast, and composition, then segments the upper clothing with SegFormer. The background is whitened to highlight the garment, and the segmented garment is randomly translated within the garment bounding box to avoid position bias. Qwen2.5-VL is then used to validate the segmented garment and filter out poor masks. Human image synthesis 5 uses the same FLUX inpainting procedure as above.
The third pipeline consists of studio capture for multi-garment triplets. Its stated purpose is to overcome the single-garment-per-clip limitation of online videos by providing high-quality studio data with full upper and lower garment triplets, enabling simultaneous multi-garment transfer training such as shirts plus pants.
The resulting training corpus comprises 9,135 videos of 3–10 seconds each, sourced from shopping sites, in-the-wild videos, and captured studio data. Evaluation uses two disjoint test sets: an Internet dataset of 80 examples from public shopping sites with product garment images, and a ViViD dataset of 50 examples built from the official ViViD test split, where 6 is created via outpainting because ViViD hides faces.
5. Objective, training strategy, and zero-shot interpolation
Vanast is fine-tuned on synthetic triplets using the standard diffusion training objective of the text-to-video backbone. The paper does not report auxiliary identity, pose, segmentation, perceptual, or temporal losses, and it does not describe classifier-free guidance. The total loss is therefore reported as the diffusion objective alone (Cha et al., 6 Apr 2026):
7
where 8 is a clean video, 9 is the noised sample at step 0, 1 is Gaussian noise, 2 is the model’s noise estimator conditioned on 3, and 4 is a noise-level weighting. The paper states accordingly that
5
This choice is paired with a frozen pretrained backbone and trainable HAM/GTM branches only. The stated rationale is that the model converges quickly, maintains generative fidelity, and improves garment accuracy, pose faithfulness, and identity preservation.
Vanast also supports zero-shot garment interpolation between two garments of the same category, 6 and 7, without fine-tuning. At each even block,
8
with 9. The reported effect is smooth, semantically coherent garment morphing while preserving identity and motion. This suggests that GTM features act as a controllable garment subspace, although the paper does not formalize that interpretation.
6. Empirical evaluation and ablations
Because no prior method is described as performing single-step try-on plus animation from 0, the evaluation constructs two-stage baselines. The first stage is either image virtual try-on or subject-to-image generation, and the second stage is human image animation. Baseline components include OOTDiffusion, CatVTON, OmniTry, Any2AnyTryon, VisualCloze, MOSAIC, UNO, StableAnimator, DisPose, and VACE in both one-stage and two-stage configurations. Evaluation metrics include frame-level 1, PSNR, SSIM, and LPIPS, and video-level FID and VFID with I3D and ResNeXt variants (Cha et al., 6 Apr 2026).
The main reported results on the Internet dataset compare Vanast against the best subject-to-image plus animation baseline and show the following changes: 2 from 3 to 4, PSNR from 5 to 6, LPIPS from 7 to 8, FID from 9 to 0, VFID1 from 2 to 3, and VFID4 from 5 to 6. Against virtual try-on plus animation baselines, Vanast reports 7, PSNR 8, LPIPS 9, FID 0, VFID1, and VFID2, with SSIM described as comparable to the best baseline.
On the ViViD dataset, Vanast reports 3, PSNR 4, LPIPS 5, FID 6, VFID7, and VFID8, and is described as consistently surpassing baselines.
| Setting | Representative results | Reported interpretation |
|---|---|---|
| Internet, best subject-to-image + animation baseline | 9, PSNR 00, LPIPS 01, FID 02, VFID03, VFID04 | Lower pose adherence and garment detail |
| Internet, Vanast | 05, PSNR 06, LPIPS 07, FID 08, VFID09, VFID10 | Better pose adherence, garment detail, and identity preservation |
| ViViD, Vanast | 11, PSNR 12, LPIPS 13, FID 14, VFID15, VFID16 | Similar gains over baselines |
The ablations are organized around architectural and supervision choices. A Single Module variant, which concatenates all conditions into one context module, is reported to struggle with pose control and to perform worse across all metrics. A Backbone-LoRA variant, which concatenates conditions into the backbone and applies LoRA on all blocks, is reported to converge faster but to have weak garment transfer and worse metrics. A “w/o SynthHuman” setting, trained without 17 and only with 18, degrades garment transfer and identity faithfulness. The full Vanast model gives the best aggregate ablation results, reported as 19, PSNR 20, SSIM 21, LPIPS 22, FID 23, VFID24, and VFID25.
7. Significance, scope, and limitations
The reported novelty of Vanast is threefold. First, it is presented as the first unified framework for virtual try-on with pose-driven animation from a single human image, garment images, and a pose video in one pass. Second, it introduces large-scale synthetic triplet supervision, including identity-preserving alternative-outfit images generated by inpainting, studio-captured upper-and-lower-garment triplets, and an in-the-wild pipeline that synthesizes garment images directly from video frames. Third, it proposes a Dual Module architecture for video DiT, with a frozen text-to-video backbone and specialized HAM and GTM residual branches, and it extends naturally to multi-garment transfer and zero-shot garment interpolation (Cha et al., 6 Apr 2026).
The method’s practical scope includes full-body or near-full-body subject images, DWPose-derived pose guidance, and one or more garment images. The implementation details reported in the paper are intentionally sparse beyond architectural essentials: the backbone is pretrained and frozen, residual injection occurs at every even block with 26, tokenization uses the backbone VAE and 27D-convolutional projections, and the data comprise 9,135 videos from three sources. Training hyperparameters, model size, optimizer, batch size, resolution, sampler, and runtime are not reported.
The paper does not enumerate explicit limitations. It does, however, note no dedicated failure-case section beyond the general discussion. Potential challenges listed in the provided description include extreme or rare poses and self-occlusions, complex garment topology such as long flowing skirts or wide loose sleeves, reflective or highly specular materials, large out-of-plane rotations that expose unseen garment backs when only front-view garment images are available, and identity fidelity under severe lighting or large appearance changes. Ethical considerations and potential misuse, including non-consensual deepfake try-on, are also identified as relevant concerns for any subject-driven generation system.
Taken together, the reported evidence positions Vanast as a video-native, condition-factored virtual try-on system in which identity anchoring, garment transfer, and motion control are optimized jointly rather than sequentially. The empirical gains over two-stage baselines and unified alternatives suggest that the combination of synthetic triplet supervision and specialized residual conditioning is effective for reducing distribution-gap artifacts while preserving temporal coherence, garment detail, and subject identity.