Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
173 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Fashion-VDM: Video Diffusion Model for Virtual Try-On (2411.00225v2)

Published 31 Oct 2024 in cs.CV

Abstract: We present Fashion-VDM, a video diffusion model (VDM) for generating virtual try-on videos. Given an input garment image and person video, our method aims to generate a high-quality try-on video of the person wearing the given garment, while preserving the person's identity and motion. Image-based virtual try-on has shown impressive results; however, existing video virtual try-on (VVT) methods are still lacking garment details and temporal consistency. To address these issues, we propose a diffusion-based architecture for video virtual try-on, split classifier-free guidance for increased control over the conditioning inputs, and a progressive temporal training strategy for single-pass 64-frame, 512px video generation. We also demonstrate the effectiveness of joint image-video training for video try-on, especially when video data is limited. Our qualitative and quantitative experiments show that our approach sets the new state-of-the-art for video virtual try-on. For additional results, visit our project page: https://johannakarras.github.io/Fashion-VDM.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (58)
  1. Sumith Kulal Daniel Mendelevitch Maciej Kilian Dominik Lorenz Yam Levi Zion English Vikram Voleti Adam Letts Varun Jampani Robin Rombach Andreas Blattmann, Tim Dockhorn. 2023. Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets.
  2. Single Stage Virtual Try-On Via Deformable Attention Flows. In Computer Vision – ECCV 2022, Shai Avidan, Gabriel Brostow, Moustapha Cissé, Giovanni Maria Farinella, and Tal Hassner (Eds.). Springer Nature Switzerland, Cham, 409–425.
  3. Align Your Latents: High-Resolution Video Synthesis With Latent Diffusion Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 22563–22575.
  4. InstructPix2Pix: Learning To Follow Image Editing Instructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 18392–18402.
  5. Extending Context Window of Large Language Models via Positional Interpolation. arXiv:arXiv:2306.15595
  6. SEINE: Short-to-Long Video Diffusion Model for Generative Transition and Prediction. arXiv:2310.20700
  7. VITON-HD: High-Resolution Virtual Try-On via Misalignment-Aware Normalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 14131–14140.
  8. Street TryOn: Learning In-the-Wild Virtual Try-On from Unpaired Person Images. arXiv:2311.16094
  9. Prafulla Dhariwal and Alex Nichol. 2021. Diffusion Models Beat GANs on Image Synthesis. arXiv:arXiv:2105.05233
  10. Dressing in the Wild by Watching Dance Videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 3480–3489.
  11. Emu Video: Factorizing Text-to-Video Generation by Explicit Image Conditioning. arXiv:arXiv:2311.10709
  12. Graphonomy: Universal Human Parsing via Graph Transfer Learning. arXiv:arXiv:1904.04536
  13. Reuse and Diffuse: Iterative Denoising for Text-to-Video Generation. arXiv:arXiv:2309.03549
  14. AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning. arXiv:arXiv:2307.04725
  15. VITON: An Image-Based Virtual Try-On Network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  16. Xiaohui Shen B. Wu Bing cheng Chen Haoye Dong, Xiaodan Liang and J. Yin. 2019. FW-GAN: Flow-Navigated Warping GAN for Video Virtual Try-On. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV). Yunlin, Taiwan, 1161––1170.
  17. Flexible Diffusion Modeling of Long Videos. arXiv:arXiv:2205.11495
  18. Style-Based Global Appearance Flow for Virtual Try-On. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 3470–3479.
  19. Latent Video Diffusion Models for High-Fidelity Long Video Generation. arXiv:arXiv:2211.13221
  20. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. In Advances in Neural Information Processing Systems, I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2017/file/8a1d694707eb0fefe65871369074926d-Paper.pdf
  21. Imagen Video: High Definition Video Generation with Diffusion Models. arXiv:arXiv:2210.02303
  22. Denoising Diffusion Probabilistic Models. arXiv:arXiv:2006.11239
  23. Jonathan Ho and Tim Salimans. 2022. Classifier-Free Diffusion Guidance. arXiv:arXiv:2207.12598
  24. Video Diffusion Models. arXiv:arXiv:2204.03458
  25. Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character Animation. arXiv:arXiv:2311.17117
  26. Deep Networks with Stochastic Depth. arXiv:arXiv:1603.09382
  27. ClothFormer: Taming Video Virtual Try-On in All Module. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 10799–10808.
  28. DreamPose: Fashion Video Synthesis with Stable Diffusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 22680–22690.
  29. StableVITON: Learning Semantic Correspondence with Latent Diffusion Model for Virtual Try-On. arXiv:2312.01725
  30. High-Resolution Virtual Try-On with Misalignment and Occlusion-Handled Conditions. In Computer Vision – ECCV 2022, Shai Avidan, Gabriel Brostow, Moustapha Cissé, Giovanni Maria Farinella, and Tal Hassner (Eds.). Springer Nature Switzerland, Cham, 204–219.
  31. Soundini: Sound-Guided Diffusion for Natural Video Editing. arXiv:2304.06818
  32. TryOnGAN: body-aware try-on via layered interpolation. ACM Trans. Graph. 40, 4, Article 115 (jul 2021), 10 pages. https://doi.org/10.1145/3450626.3459884
  33. Compositional visual generation with composable diffusion models. In European Conference on Computer Vision. Springer, 423–439.
  34. Kangfu Mei and Vishal Patel. 2023. VIDM: Video Implicit Diffusion Models. Proceedings of the AAAI Conference on Artificial Intelligence 37, 8 (Jun. 2023), 9117–9125.
  35. Controllable Person Image Synthesis With Attribute-Decomposed GAN. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
  36. William Peebles and Saining Xie. 2022. Scalable Diffusion Models with Transformers. arXiv:arXiv:2212.09748
  37. Learning Transferable Visual Models From Natural Language Supervision. arXiv:arXiv:2103.00020
  38. Neural Texture Extraction and Distribution for Controllable Person Image Synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 13535–13544.
  39. High-Resolution Image Synthesis With Latent Diffusion Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 10684–10695.
  40. Tim Salimans and Jonathan Ho. 2022. Progressive Distillation for Fast Sampling of Diffusion Models. arXiv:arXiv:2202.00512
  41. LAION-5B: An open large-scale dataset for training next generation image-text models. arXiv:arXiv:2210.08402
  42. Deep Unsupervised Learning using Nonequilibrium Thermodynamics. arXiv:arXiv:1503.03585
  43. Denoising Diffusion Implicit Models. arXiv:arXiv:2010.02502
  44. Yang Song and Stefano Ermon. 2019. Generative Modeling by Estimating Gradients of the Data Distribution. arXiv:arXiv:1907.05600
  45. Towards Accurate Generative Models of Video: A New Metric & Challenges. arXiv:arXiv:1812.01717
  46. Gen-L-Video: Multi-Text to Long Video Generation via Temporal Co-Denoising. arXiv:2305.18264
  47. LAVIE: High-Quality Video Generation with Cascaded Latent Diffusion Models. arXiv:arXiv:2309.15103
  48. Yi-Cheng Tien Wen-Jiin Tsai. 2023. Attention-based Video Virtual Try-On. ACM, Proceedings of the 2023 ACM International Conference on Multimedia Retrieval, 209–216.
  49. Weilin Huang Xintong Han, Xiaojun Hu and Matthew R Scott. 2020. Clothflow: A flow-based model for clothed person generation. Proceedings of the IEEE/CVF international conference on computer vision, 139–144,.
  50. MagicAnimate: Temporally Consistent Human Image Animation using Diffusion Model. arXiv:arXiv:2311.16498
  51. Towards Photo-Realistic Virtual Try-On by Adaptively Generating-Preserving Image Content. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
  52. Vtnfp: An image-based virtual try-on network with body and clothing feature preservation. Proceedings of the IEEE/CVF international conference on computer vision, 10511–10520.
  53. DwNet: Dense warp-based network for pose-guided human video generation. arXiv:arXiv:1910.09139
  54. PISE: Person Image Synthesis and Editing With Decoupled GAN. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 7982–7990.
  55. WarpDiffusion: Efficient Diffusion Model for High-Fidelity Virtual Try-on. arXiv:2312.03667
  56. MV-TON: Memory-based Video Virtual Try-on network. (2021). arXiv:arXiv:2108.07502
  57. M&M VTO: Multi-Garment Virtual Try-On and Editing.
  58. TryOnDiffusion: A Tale of Two UNets. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 4606–4615.
Citations (2)

Summary

We haven't generated a summary for this paper yet.

Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com