PG-VTON: A Novel Image-Based Virtual Try-On Method via Progressive Inference Paradigm (2304.08956v2)
Abstract: Virtual try-on is a promising computer vision topic with a high commercial value wherein a new garment is visually worn on a person with a photo-realistic effect. Previous studies conduct their shape and content inference at one stage, employing a single-scale warping mechanism and a relatively unsophisticated content inference mechanism. These approaches have led to suboptimal results in terms of garment warping and skin reservation under challenging try-on scenarios. To address these limitations, we propose a novel virtual try-on method via progressive inference paradigm (PGVTON) that leverages a top-down inference pipeline and a general garment try-on strategy. Specifically, we propose a robust try-on parsing inference method by disentangling semantic categories and introducing consistency. Exploiting the try-on parsing as the shape guidance, we implement the garment try-on via warping-mapping-composition. To facilitate adaptation to a wide range of try-on scenarios, we adopt a covering more and selecting one warping strategy and explicitly distinguish tasks based on alignment. Additionally, we regulate StyleGAN2 to implement re-naked skin inpainting, conditioned on the target skin shape and spatial-agnostic skin features. Experiments demonstrate that our method has state-of-the-art performance under two challenging scenarios. The code will be available at https://github.com/NerdFNY/PGVTON.
- P. Hu, E. S.-L. Ho, and A. Munteanu, “3dbodynet: fast reconstruction of 3d animatable human body shape from a single commodity depth camera,” IEEE Transactions on Multimedia, vol. 24, pp. 2139–2149, 2021.
- T. Zhao, S. Li, K. N. Ngan, and F. Wu, “3-d reconstruction of human body shape from a single commodity depth camera,” IEEE Transactions on Multimedia, vol. 21, no. 1, pp. 114–123, 2018.
- Y. A. Sekhavat, “Privacy preserving cloth try-on using mobile augmented reality,” IEEE Transactions on Multimedia, vol. 19, no. 5, pp. 1041–1049, 2016.
- X. Han, Z. Wu, Z. Wu, R. Yu, and L. S. Davis, “Viton: An image-based virtual try-on network,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR), 2018, pp. 7543–7552.
- B. Wang, H. Zheng, X. Liang, Y. Chen, L. Lin, and M. Yang, “Toward characteristic-preserving image-based virtual try-on network,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 589–604.
- R. Yu, X. Wang, and X. Xie, “Vtnfp: An image-based virtual try-on network with body and clothing feature preservation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 10 511–10 520.
- H. Yang, R. Zhang, X. Guo, W. Liu, W. Zuo, and P. Luo, “Towards photo-realistic virtual try-on by adaptively generating-preserving image content,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 7850–7859.
- M. R. Minar, T. T. Tuan, H. Ahn, P. Rosin, and Y.-K. Lai, “Cp-vton+: Clothing shape and texture preserving image-based virtual try-on,” in CVPR Workshops, 2020.
- A. Neuberger, E. Borenstein, B. Hilleli, E. Oks, and S. Alpert, “Image based virtual try-on network from unpaired data,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 5184–5193.
- C. Ge, Y. Song, Y. Ge, H. Yang, W. Liu, and P. Luo, “Disentangled cycle consistency for highly-realistic virtual try-on,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 16 928–16 937.
- S. He, Y.-Z. Song, and T. Xiang, “Style-based global appearance flow for virtual try-on,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 3470–3479.
- S. Bai, H. Zhou, Z. Li, C. Zhou, and H. Yang, “Single stage virtual try-on via deformable attention flows,” in Proceedings of the European Conference on Computer Vision (ECCV). Springer, 2022, pp. 409–425.
- N. Fang, L. Qiu, S. Zhang, Z. Wang, K. Hu, and L. Dong, “A novel human image sequence synthesis method by pose-shape-content inference,” IEEE Transactions on Multimedia, vol. 25, pp. 6512–6524, 2023.
- L. Ma, K. Huang, D. Wei, Z.-Y. Ming, and H. Shen, “Fda-gan: Flow-based dual attention gan for human pose transfer,” IEEE Transactions on Multimedia, 2021.
- B. Hu, P. Liu, Z. Zheng, and M. Ren, “Spg-vton: Semantic prediction guidance for multi-pose virtual try-on,” IEEE Transactions on Multimedia, vol. 24, pp. 1233–1246, 2022.
- H. Dong, X. Liang, X. Shen, B. Wang, H. Lai, J. Zhu, Z. Hu, and J. Yin, “Towards multi-pose guided virtual try-on network,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 9026–9035.
- Z. Xie, Z. Huang, F. Zhao, H. Dong, M. Kampffmeyer, and X. Liang, “Towards scalable unpaired virtual try-on via patch-routed spatially-adaptive gan,” in Advances in Neural Information Processing Systems (NeurIPS), vol. 34, 2021, pp. 2598–2610.
- A. Cui, D. McKee, and S. Lazebnik, “Dressing in order: Recurrent person image generation for pose transfer, virtual try-on and outfit editing,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 14 638–14 647.
- A. Raj, P. Sangkloy, H. Chang, J. Hays, D. Ceylan, and J. Lu, “Swapnet: Image based garment transfer,” in Proceedings of the European Conference on Computer Vision (ECCV). Springer, 2018, pp. 679–695.
- Y. Liu, W. Chen, L. Liu, and M. S. Lew, “Swapgan: A multistage generative approach for person-to-person fashion style transfer,” IEEE Transactions on Multimedia, vol. 21, no. 9, pp. 2209–2222, 2019.
- T. Liu, J. Zhang, X. Nie, Y. Wei, S. Wei, Y. Zhao, and J. Feng, “Spatial-aware texture transformer for high-fidelity garment transfer,” IEEE Transactions on Image Processing, vol. 30, pp. 7499–7510, 2021.
- K. M. Lewis, S. Varadharajan, and I. Kemelmacher-Shlizerman, “Tryongan: Body-aware try-on via layered interpolation,” ACM Transactions on Graphics (TOG), vol. 40, no. 4, pp. 1–10, 2021.
- H. Dong, X. Liang, X. Shen, B. Wu, B.-C. Chen, and J. Yin, “Fw-gan: Flow-navigated warping gan for video virtual try-on,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 1161–1170.
- C.-Y. Chen, L. Lo, P.-J. Huang, H.-H. Shuai, and W.-H. Cheng, “Fashionmirror: Co-attention feature-remapping virtual try-on with sequential template poses,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 13 809–13 818.
- S. Choi, S. Park, M. Lee, and J. Choo, “Viton-hd: High-resolution virtual try-on via misalignment-aware normalization,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 14 131–14 140.
- T. Issenhuth, J. Mary, and C. Calauzènes, “Do not mask what you do not need to mask: a parser-free virtual try-on,” in Proceedings of the European Conference on Computer Vision (ECCV). Springer, 2020, pp. 619–635.
- Y. Ge, Y. Song, R. Zhang, C. Ge, W. Liu, and P. Luo, “Parser-free virtual try-on via distilling appearance flows,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 8485–8493.
- M. Jaderberg, K. Simonyan, A. Zisserman, and k. kavukcuoglu, “Spatial transformer networks,” in Advances in Neural Information Processing Systems (NeurIPS), vol. 28. Curran Associates, Inc., 2015.
- F. L. Bookstein, “Principal warps: Thin-plate splines and the decomposition of deformations,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 11, no. 6, pp. 567–585, 1989.
- A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas, V. Golkov, P. Van Der Smagt, D. Cremers, and T. Brox, “Flownet: Learning optical flow with convolutional networks,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2015, pp. 2758–2766.
- O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computer-assisted intervention. Springer, 2015, pp. 234–241.
- A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” in International Conference on Learning Representations (ICLR), 2020.
- Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 10 012–10 022.
- J. Yang, C. Li, P. Zhang, X. Dai, B. Xiao, L. Yuan, and J. Gao, “Focal attention for long-range interactions in vision transformers,” in Advances in Neural Information Processing Systems (NeurIPS), vol. 34. Curran Associates, Inc., 2021, pp. 30 008–30 022.
- S. W. Zamir, A. Arora, S. Khan, M. Hayat, F. S. Khan, and M.-H. Yang, “Restormer: Efficient transformer for high-resolution image restoration,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 5728–5739.
- T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila, “Analyzing and improving the image quality of stylegan,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 8110–8119.
- R. A. Güler, N. Neverova, and I. Kokkinos, “Densepose: Dense human pose estimation in the wild,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 7297–7306.
- H. He, J. Zhang, Q. Zhang, and D. Tao, “Grapy-ml: Graph pyramid mutual learning for cross-dataset human parsing,” in Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), vol. 34, no. 07, 2020, pp. 10 949–10 956.
- X. Qin, Z. Zhang, C. Huang, M. Dehghan, O. R. Zaiane, and M. Jagersand, “U2-net: Going deeper with nested u-structure for salient object detection,” Pattern Recognition, vol. 106, p. 107404, 2020.
- K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
- W. Peebles, J.-Y. Zhu, R. Zhang, A. Torralba, A. A. Efros, and E. Shechtman, “Gan-supervised dense visual alignment,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 13 470–13 481.
- Z. Zhong, L. Zheng, G. Kang, S. Li, and Y. Yang, “Random erasing data augmentation,” in Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), vol. 34, no. 07, 2020, pp. 13 001–13 008.
- M. Z. Alom, C. Yakopcic, M. Hasan, T. M. Taha, and V. K. Asari, “Recurrent residual u-net for medical image segmentation,” Journal of Medical Imaging, vol. 6, no. 1, p. 014006, 2019.
- Z. Zhang, Q. Liu, and Y. Wang, “Road extraction by deep residual u-net,” IEEE Geoscience and Remote Sensing Letters, vol. 15, no. 5, pp. 749–753, 2018.
- Z. Zhou, M. M. R. Siddiquee, N. Tajbakhsh, and J. Liang, “Unet++: Redesigning skip connections to exploit multiscale features in image segmentation,” IEEE Transactions on Medical Imaging, vol. 39, no. 6, pp. 1856–1867, 2019.
- O. Oktay, J. Schlemper, L. L. Folgoc, M. Lee, M. Heinrich, K. Misawa, K. Mori, S. McDonagh, N. Y. Hammerla, B. Kainz et al., “Attention u-net: Learning where to look for the pancreas,” arXiv preprint arXiv:1804.03999, 2018.
- S. Zheng, J. Lu, H. Zhao, X. Zhu, Z. Luo, Y. Wang, Y. Fu, J. Feng, T. Xiang, P. H. Torr et al., “Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 6881–6890.
- H. Cao, Y. Wang, J. Chen, D. Jiang, X. Zhang, Q. Tian, and M. Wang, “Swin-unet: Unet-like pure transformer for medical image segmentation,” arXiv preprint arXiv:2105.05537, 2021.
- Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh, “Realtime multi-person 2d pose estimation using part affinity fields,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 7291–7299.
- T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen, “Improved techniques for training gans,” in Advances in Neural Information Processing Systems (NeurIPS), 2016, pp. 2234–2242.
- T. Karras, S. Laine, and T. Aila, “A style-based generator architecture for generative adversarial networks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 4401–4410.
- G. Liu, D. Song, R. Tong, and M. Tang, “Toward realistic virtual try-on through landmark guided shape matching,” in Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), vol. 35, no. 3, 2021, pp. 2118–2126.
- X. Han, S. Zhang, Q. Liu, Z. Li, and C. Wang, “Progressive limb-aware virtual try-on,” in Proceedings of the 30th ACM International Conference on Multimedia, 2022, pp. 2420–2429.
- S. Su, Q. Yan, Y. Zhu, C. Zhang, X. Ge, J. Sun, and Y. Zhang, “Blindly assess image quality in the wild guided by a self-adaptive hyper network,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 3667–3676.