Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

VITON-DiT: Learning In-the-Wild Video Try-On from Human Dance Videos via Diffusion Transformers (2405.18326v2)

Published 28 May 2024 in cs.CV

Abstract: Video try-on stands as a promising area for its tremendous real-world potential. Prior works are limited to transferring product clothing images onto person videos with simple poses and backgrounds, while underperforming on casually captured videos. Recently, Sora revealed the scalability of Diffusion Transformer (DiT) in generating lifelike videos featuring real-world scenarios. Inspired by this, we explore and propose the first DiT-based video try-on framework for practical in-the-wild applications, named VITON-DiT. Specifically, VITON-DiT consists of a garment extractor, a Spatial-Temporal denoising DiT, and an identity preservation ControlNet. To faithfully recover the clothing details, the extracted garment features are fused with the self-attention outputs of the denoising DiT and the ControlNet. We also introduce novel random selection strategies during training and an Interpolated Auto-Regressive (IAR) technique at inference to facilitate long video generation. Unlike existing attempts that require the laborious and restrictive construction of a paired training dataset, severely limiting their scalability, VITON-DiT alleviates this by relying solely on unpaired human dance videos and a carefully designed multi-stage training strategy. Furthermore, we curate a challenging benchmark dataset to evaluate the performance of casual video try-on. Extensive experiments demonstrate the superiority of VITON-DiT in generating spatio-temporal consistent try-on results for in-the-wild videos with complicated human poses.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (51)
  1. Align your latents: High-resolution video synthesis with latent diffusion models. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22563–22575, 2023.
  2. Quo vadis, action recognition? a new model and the kinetics dataset. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jul 2017.
  3. Magic clothing: Controllable garment-driven image synthesis. arXiv preprint arXiv:2404.09512, 2024.
  4. Anydoor: Zero-shot object-level image customization. arXiv preprint arXiv:2307.09481, 2023.
  5. Viton-hd: High-resolution virtual try-on via misalignment-aware normalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14131–14140, 2021.
  6. Fw-gan: Flow-navigated warping gan for video virtual try-on. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Oct 2019.
  7. Flownet: Learning optical flow with convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, pages 2758–2766, 2015.
  8. Scaling rectified flow transformers for high-resolution image synthesis, 2024.
  9. Taming transformers for high-resolution image synthesis. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2021.
  10. Parser-free virtual try-on via distilling appearance flows. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8485–8493, 2021.
  11. Instance-level human parsing via part grouping network. In Vittorio Ferrari, Martial Hebert, Cristian Sminchisescu, and Yair Weiss, editors, Computer Vision – ECCV 2018, pages 805–822, Cham, 2018. Springer International Publishing.
  12. Sparsectrl: Adding sparse controls to text-to-video diffusion models. arXiv preprint arXiv:2311.16933, 2023.
  13. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. International Conference on Learning Representations, 2024.
  14. Viton: An image-based virtual try-on network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7543–7552, 2018.
  15. Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet? In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun 2018.
  16. Style-based global appearance flow for virtual try-on. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3470–3479, 2022.
  17. hpcaitech. Open-sora: Democratizing efficient video production for all. https://github.com/hpcaitech/Open-Sora, 2024.
  18. Animate anyone: Consistent and controllable image-to-video synthesis for character animation. arXiv preprint arXiv:2311.17117, 2023.
  19. Perceiver: General perception with iterative attention. In International conference on machine learning, pages 4651–4664. PMLR, 2021.
  20. Learning high fidelity depths of dressed humans by watching social media dance videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12753–12762, June 2021.
  21. Clothformer: Taming video virtual try-on in all module. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  22. Stableviton: Learning semantic correspondence with latent diffusion model for virtual try-on. arXiv preprint arxiv:2312.01725, 2023.
  23. Auto-encoding variational bayes. CoRR, abs/1312.6114, 2013.
  24. Shineon: Illuminating design choices for practical video-based virtual clothing try-on. In 2021 IEEE Winter Conference on Applications of Computer Vision Workshops (WACVW), Jan 2021.
  25. Open-sora-plan, April 2024.
  26. Deepfashion: Powering robust clothes recognition and retrieval with rich annotations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1096–1104, 2016.
  27. Fixing weight decay regularization in adam. ArXiv, abs/1711.05101, 2017.
  28. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453, 2023.
  29. OpenAI. "sora: Creating video from text.". https://openai.com/sora, 2024.
  30. Scalable diffusion models with transformers. arXiv preprint arXiv:2212.09748, 2022.
  31. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21:140:1–140:67, 2019.
  32. High-resolution image synthesis with latent diffusion models. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2022.
  33. U-net: Convolutional networks for biomedical image segmentation. In Nassir Navab, Joachim Hornegger, William M. Wells, and Alejandro F. Frangi, editors, Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015, pages 234–241, Cham, 2015. Springer International Publishing.
  34. Edge: Editable dance generation from music. arXiv preprint arXiv:2211.10658, 2022.
  35. Attention is all you need. In Neural Information Processing Systems, 2017.
  36. Toward characteristic-preserving image-based virtual try-on network. In Proceedings of the European Conference on Computer Vision (ECCV), pages 589–604, 2018.
  37. Stablegarment: Garment-centric generation via stable diffusion, 2024.
  38. Videocomposer: Compositional video synthesis with motion controllability. 2023.
  39. Image quality assessment: from error visibility to structural similarity. In IEEE Transactions on Image Processing, volume 13, pages 600–612, 2004.
  40. Towards scalable unpaired virtual try-on via patch-routed spatially-adaptive gan. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 2598–2610. Curran Associates, Inc., 2021.
  41. Towards scalable unpaired virtual try-on via patch-routed spatially-adaptive gan. In Neural Information Processing Systems, 2021.
  42. Ootdiffusion: Outfitting fusion based latent diffusion for controllable virtual try-on. arXiv preprint arXiv:2403.01779, 2024.
  43. Tunnel try-on: Excavating spatial-temporal tunnels for high-quality virtual try-on in videos. arXiv preprint, 2024.
  44. Paint by example: Exemplar-based image editing with diffusion models. arXiv preprint arXiv:2211.13227, 2022.
  45. Nuwa-xl: Diffusion over diffusion for extremely long video generation, 2023.
  46. Dwnet: Dense warp-based network for pose-guided human video generation. ArXiv, abs/1910.09139, 2019.
  47. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023.
  48. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 586–595, 2018.
  49. I2vgen-xl: High-quality image-to-video synthesis via cascaded diffusion models. 2023.
  50. Gp-vton: Towards general purpose virtual try-on via collaborative local-flow global-parsing learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2023.
  51. Mv-ton: Memory-based video virtual try-on network. In Proceedings of the 29th ACM International Conference on Multimedia, Oct 2021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Jun Zheng (40 papers)
  2. Fuwei Zhao (10 papers)
  3. Youjiang Xu (10 papers)
  4. Xin Dong (90 papers)
  5. Xiaodan Liang (318 papers)
Citations (2)

Summary

We haven't generated a summary for this paper yet.