OmniTryOn: Video Try-On Anything at Once!

Published 7 Jun 2026 in cs.CV | (2606.08514v1)

Abstract: Although video virtual try-on (VVT) has achieved significant progress, existing methods still exhibit two fundamental limitations: first, they are restricted to single-garment transfer, rendering simultaneous multi-object try-on highly impractical; second, their heavy reliance on explicit external priors (e.g., garment masks) inevitably destroys crucial physical dynamics and degrades visual quality. To bridge this gap, this paper proposes the novel Try-On Anything task, which aims to simultaneously transfer diverse wearable objects onto a person in a video in a single inference pass. To support and standardize this paradigm, we introduce TryAny-Bench, a comprehensive benchmark encompassing a paired video dataset alongside a tailored evaluation protocol. Furthermore, we present OmniTryOn, an external-prior-free generative framework designed to tackle this task. Specifically, OmniTryOn employs a First Frame Wearable Cache strategy, which directly provides diverse wearable objects for the generation process through the initial video frame. To maintain consistency, we propose the Spatiotemporally Consistent RoPE (STC-RoPE), which inherently establishes robust spatiotemporal anchors to strictly preserve complex human motions and background dynamics. Optimized by the proposed Gradual Try-On (GTO) training strategy, our model progressively masters robust multi-object synthesis. Extensive experiments on TryAny-Bench demonstrate that OmniTryOn significantly outperforms existing specialized video virtual try-on models and general video editing baselines, establishing a powerful new standard for the Try-On Anything task. Our dataset, code, and models are available at https://github.com/xcltql666/OminTryOn.

Abstract PDF Upgrade to Chat

Authors (6)

Summary

The paper introduces a unified video try-on framework that transfers any wearable object in a single inference pass.
It employs a Diffusion Transformer backbone with innovations like the First Frame Wearable Cache and STC-RoPE to ensure high temporal consistency.
The TryAny-Bench benchmark and Gradual Try-On strategy validate state-of-the-art performance in achieving realistic multi-object transfer.

OmniTryOn: A Unified Framework for Video Try-On Anything

Motivation and Problem Formulation

OmniTryOn proposes a new direction in video virtual try-on (VVT), advancing beyond prior work by explicitly tackling the simultaneous transfer of arbitrary wearable objects—including garments, handbags, shoes, and facial identities—onto individuals in video sequences, all within a single inference pass. Existing VVT systems are limited in scope: they focus on single-garment transfer and are critically dependent on garment- or body-part-specific priors such as masks or pose estimations, which introduce spatiotemporal artifacts and hinder the transfer of more complex or pluralistic wearables. OmniTryOn redefines the problem as the Try-On Anything task, aiming for scalable, high-fidelity video synthesis across diverse customization targets, prior-free.

TryAny-Bench Benchmark: Dataset and Protocol Innovations

To enable research in this paradigm, TryAny-Bench is introduced as a comprehensive dataset and evaluation suite, specifically designed for the multi-object video try-on challenge. The construction pipeline (Figure 1) leverages over 1,500 e-commerce videos to generate paired reference/target sequences, systematically replacing and perturbing a variety of wearable objects using sequential application of state-of-the-art VVT and editing models. A unique advantage of TryAny-Bench is the elimination of destructive mask-based training pairs, instead providing direct paired samples across a heterogeneous set of wearables.

Figure 1: TryAny-Bench data construction automates the extraction and augmentation of diverse wearables, yielding physics-preserving paired videos; the evaluation protocol addresses fine-grained dimensions of video realism and object integrity.

Further, TryAny-Bench establishes a tailored multidimensional evaluation protocol based on specialized Video Question Answering (VQA) for rigorous, aspect-specific quality measurement. This protocol covers Video Quality (visual fidelity, action, background consistency), Try-On Stability (object integrity, material fidelity, scale), and Physical Realism (temporal coherence, anatomical correctness, dynamic plausibility). Such a VQA-driven metric suite exposes model performance beyond superficial similarity metrics, enabling precise diagnosis of artifacts and failures.

OmniTryOn Architecture

The proposed OmniTryOn framework is built upon the Diffusion Transformer (DiT) backbone, integrating several architectural innovations to address the task's demands (Figure 2):

First Frame Wearable Cache: The image of all target wearable objects is encoded and directly prepended as the first frame in the latent sequence of the target, serving as a persistent object-attribute cache. This strategy exploits in-context attention for fine-grained propagation of object information across temporal denoising steps, bypassing the need for explicit object priors and enabling scalable multi-object handling.
Spatiotemporally Consistent Rotary Position Embedding (STC-RoPE): Both reference and target latents are grounded in the same 3D positional encoding, forcibly anchoring complex motion and background dynamics. In contrast to prior methods that bias RoPE to separate conditions and targets, STC-RoPE creates a strict identity mapping in spatiotemporal space, supporting detailed and consistent attribute transfer.
Figure 2: The OmniTryOn pipeline injects diverse wearable object information in the first latent frame and enforces shared spatiotemporal position anchoring via STC-RoPE.

Textual description encodings (via LLM-based captioning) and CLIP embeddings further guide semantics. The architecture thus achieves high-fidelity, consistent synthesis in a unified, prior-free fashion.

Training Regime: Gradual Try-On Strategy

To address the complexity of multi-object transfer (especially for objects with large and deformable spatial support, like garments), the Gradual Try-On (GTO) training strategy is adopted. Training proceeds in two distinct phases:

Stage 1: Restricted to garment-only try-on, focusing model capacity on non-rigid transformations and detailed attribute transfer.
Stage 2: Progressive expansion to all wearables (handbags, shoes, faces) for simultaneous multi-object transfer.

Model optimization uses a flow matching-based objective over latent space interpolations between Gaussian noise and the target sequence, conditioned on reference video and object semantics.

Experimental Results

Quantitative Evaluation

OmniTryOn is evaluated on TryAny-Bench against multiple state-of-the-art VVT baselines (MagicTryOn, CatV2TON, ViViD) and general video editing frameworks (VACE, Video-As-Prompt). Across all established metrics (SSIM, LPIPS, VFID-I, VFID-R), OmniTryOn outperforms baselines, notably achieving lower VFID and higher SSIM/LPIPS.

Ablation studies show the criticality of the STC-RoPE; biased variants induce disruptions in mutual perceptual anchoring, leading to degraded consistency and realism. The GTO training strategy further demonstrates superior optimization stability and final performance.

A VQA-based multidimensional radar analysis (Figure 3) confirms that OmniTryOn dominates baselines across all axes, with pronounced gains in object integrity and material fidelity—criteria where prior methods collapse due to either mask-based artifacts or error compounding in sequential inference for multi-garment synthesis.

Figure 3: Radar plots of VQA-based evaluation indicate that OmniTryOn outperforms baselines in all pivotal dimensions of try-on synthesis quality.

Qualitative Analysis

OmniTryOn yields visually robust results (Figure 4), generating physically plausible, temporally stable video with multiple wearables. Competing VVT models fail to transfer anything beyond garments and struggle with coherence and color fidelity, often hallucinating undesirable artifacts (e.g., ghosting, unnatural textures, misplaced semantic elements). General video editing systems lack local alignment, with outputs suffering from implausible deformations and background drift.

Figure 4: OmniTryOn delivers higher realism and maintains spatiotemporal consistency compared to specialized VVT and general video baselines.

Theoretical and Practical Implications

OmniTryOn's architectural design demonstrates that joint in-context injection (via the First Frame Wearable Cache) coupled with strict spatiotemporal anchoring (STC-RoPE) is sufficient for multi-object try-on in videos, removing the reliance on brittle, error-prone external priors. This validates that recent advances in DiT-based video models, when properly conditioned, are capable of supporting realistic and controllable video customization tasks beyond the current VVT paradigm. The results suggest a generalizable path for prior-free, simultaneous multi-object control in generative video modeling.

Practically, this framework—together with the TryAny-Bench task standard—enables richer and more immersive video-driven applications, e.g., in e-commerce (allowing real-time, arbitrary visual customization), digital content creation, and mixed reality. Future extensions may focus on scaling object categories, supporting dynamic interactions, and further improving semantic controllability or editability via more expressive conditioning.

Conclusion

OmniTryOn establishes a new state-of-the-art for video try-on by introducing the Try-On Anything task, a robust benchmark, and a high-fidelity, prior-free generative approach. Its architectural innovations—the First Frame Wearable Cache and Spatiotemporally Consistent RoPE—address the intricate technical challenges of multi-object, consistent transfer. The strong empirical results on TryAny-Bench, supported by comprehensive ablations and qualitative evaluations, mark a significant advancement in controllable, scalable video synthesis. This framework will likely serve as a new cornerstone for subsequent research in unified video customization and multi-object editing (2606.08514).

Markdown Report Issue