Sketch2Human: Deep Human Generation with Disentangled Geometry and Appearance Control (2404.15889v1)
Abstract: Geometry- and appearance-controlled full-body human image generation is an interesting but challenging task. Existing solutions are either unconditional or dependent on coarse conditions (e.g., pose, text), thus lacking explicit geometry and appearance control of body and garment. Sketching offers such editing ability and has been adopted in various sketch-based face generation and editing solutions. However, directly adapting sketch-based face generation to full-body generation often fails to produce high-fidelity and diverse results due to the high complexity and diversity in the pose, body shape, and garment shape and texture. Recent geometrically controllable diffusion-based methods mainly rely on prompts to generate appearance and it is hard to balance the realism and the faithfulness of their results to the sketch when the input is coarse. This work presents Sketch2Human, the first system for controllable full-body human image generation guided by a semantic sketch (for geometry control) and a reference image (for appearance control). Our solution is based on the latent space of StyleGAN-Human with inverted geometry and appearance latent codes as input. Specifically, we present a sketch encoder trained with a large synthetic dataset sampled from StyleGAN-Human's latent space and directly supervised by sketches rather than real images. Considering the entangled information of partial geometry and texture in StyleGAN-Human and the absence of disentangled datasets, we design a novel training scheme that creates geometry-preserved and appearance-transferred training data to tune a generator to achieve disentangled geometry and appearance control. Although our method is trained with synthetic data, it can handle hand-drawn sketches as well. Qualitative and quantitative evaluations demonstrate the superior performance of our method to state-of-the-art methods.
- H. Dong, X. Liang, Y. Zhang, X. Zhang, X. Shen, Z. Xie, B. Wu, and J. Yin, “Fashion editing with adversarial parsing learning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020, pp. 8120–8128.
- X. Han, X. Hu, W. Huang, and M. R. Scott, “Clothflow: A flow-based model for clothed person generation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 10 471–10 480.
- B. Albahar, J. Lu, J. Yang, Z. Shu, E. Shechtman, and J.-B. Huang, “Pose with style: Detail-preserving pose-guided image synthesis with conditional stylegan,” ACM Transactions on Graphics (TOG), vol. 40, no. 6, pp. 1–11, 2021.
- X. Dong, F. Zhao, Z. Xie, X. Zhang, D. K. Du, M. Zheng, X. Long, X. Liang, and J. Yang, “Dressing in the wild by watching dance videos,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2022, pp. 3480–3489.
- Y. Ren, X. Fan, G. Li, S. Liu, and T. H. Li, “Neural texture extraction and distribution for controllable person image synthesis,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2022, pp. 13 535–13 544.
- Y. Men, Y. Mao, Y. Jiang, W.-Y. Ma, and Z. Lian, “Controllable person image synthesis with attribute-decomposed gan,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020, pp. 5084–5093.
- A. Frühstück, K. K. Singh, E. Shechtman, N. J. Mitra, P. Wonka, and J. Lu, “Insetgan for full-body image generation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2022, pp. 7723–7732.
- C. Chan, S. Ginosar, T. Zhou, and A. A. Efros, “Everybody dance now,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 5933–5942.
- F. Hong, M. Zhang, L. Pan, Z. Cai, L. Yang, and Z. Liu, “Avatarclip: Zero-shot text-driven generation and animation of 3d avatars,” ACM Transactions on Graphics (TOG), vol. 41, no. 4, pp. 1–19, 2022.
- J. Fu, S. Li, Y. Jiang, K.-Y. Lin, C. Qian, C. C. Loy, W. Wu, and Z. Liu, “Stylegan-human: A data-centric odyssey of human generation,” in Proceedings of European Conference on Computer Vision. Springer, 2022, pp. 1–19.
- A. Siarohin, E. Sangineto, S. Lathuiliere, and N. Sebe, “Deformable gans for pose-based human image generation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 3408–3416.
- J. Tang, Y. Yuan, T. Shao, Y. Liu, M. Wang, and K. Zhou, “Structure-aware person image generation with pose decomposition and semantic correlation,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 3, 2021, pp. 2656–2664.
- A. Baldrati, D. Morelli, G. Cartella, M. Cornia, M. Bertini, and R. Cucchiara, “Multimodal garment designer: Human-centric latent diffusion models for fashion image editing,” arXiv preprint arXiv:2304.02051, 2023.
- X. Han, Z. Wu, Z. Wu, R. Yu, and L. S. Davis, “Viton: An image-based virtual try-on network,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 7543–7552.
- S. Choi, S. Park, M. Lee, and J. Choo, “Viton-hd: High-resolution virtual try-on via misalignment-aware normalization,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021, pp. 14 131–14 140.
- S. He, Y.-Z. Song, and T. Xiang, “Style-based global appearance flow for virtual try-on,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2022, pp. 3470–3479.
- K. Kim, S. Park, J. Lee, and J. Choo, “Reference-based image composition with sketch via structure-aware diffusion model,” arXiv preprint arXiv:2304.09748, 2023.
- W. Chen and J. Hays, “Sketchygan: Towards diverse and realistic sketch to image synthesis,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 9416–9425.
- Y. Li, X. Chen, B. Yang, Z. Chen, Z. Cheng, and Z.-J. Zha, “Deepfacepencil: Creating face images from freehand sketches,” in Proceedings of the 28th ACM International Conference on Multimedia, 2020, pp. 991–999.
- S.-Y. Chen, F.-L. Liu, Y.-K. Lai, P. L. Rosin, C. Li, H. Fu, and L. Gao, “DeepFaceEditing: Deep face generation and editing with disentangled geometry and appearance control,” ACM Transactions on Graphics (TOG), vol. 40, no. 4, pp. 90:1–90:15, 2021.
- R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 10 684–10 695.
- A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, “Hierarchical text-conditional image generation with clip latents,” arXiv preprint arXiv:2204.06125, 2022.
- L. Zhang and M. Agrawala, “Adding conditional control to text-to-image diffusion models,” arXiv preprint arXiv:2302.05543, 2023.
- C. Mou, X. Wang, L. Xie, J. Zhang, Z. Qi, Y. Shan, and X. Qie, “T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models,” arXiv preprint arXiv:2302.08453, 2023.
- K. Sarkar, L. Liu, V. Golyanik, and C. Theobalt, “Humangan: A generative model of human images,” in 2021 International Conference on 3D Vision (3DV). IEEE, 2021, pp. 258–267.
- K. Sarkar, V. Golyanik, L. Liu, and C. Theobalt, “Style and pose control for image synthesis of humans from a single monocular view,” arXiv preprint arXiv:2102.11263, 2021.
- Y. Jiang, S. Yang, H. Qju, W. Wu, C. C. Loy, and Z. Liu, “Text2human: Text-driven controllable human image generation,” ACM Transactions on Graphics (TOG), vol. 41, no. 4, pp. 1–11, 2022.
- S.-Y. Chen, W. Su, L. Gao, S. Xia, and H. Fu, “Deepfacedrawing: Deep generation of face images from sketches,” ACM Transactions on Graphics (TOG), vol. 39, no. 4, pp. 72–1, 2020.
- X. Wu, C. Wang, H. Fu, A. Shamir, S.-H. Zhang, and S.-M. Hu, “Deepportraitdrawing: Generating human body images from freehand sketches,” arXiv preprint arXiv:2205.02070, 2022.
- X. Zhou, B. Zhang, T. Zhang, P. Zhang, J. Bao, D. Chen, Z. Zhang, and F. Wen, “Cocosnet v2: Full-resolution correspondence learning for image translation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021, pp. 11 465–11 475.
- F. Zhan, Y. Yu, R. Wu, J. Zhang, S. Lu, and C. Zhang, “Marginal contrastive correspondence for guided image generation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2022, pp. 10 663–10 672.
- S. Liu, J. Ye, S. Ren, and X. Wang, “Dynast: Dynamic sparse transformer for exemplar-guided image generation,” in Proceedings of European Conference on Computer Vision. Springer, 2022, pp. 72–90.
- E. Richardson, Y. Alaluf, O. Patashnik, Y. Nitzan, Y. Azar, S. Shapiro, and D. Cohen-Or, “Encoding in style: a stylegan encoder for image-to-image translation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021, pp. 2287–2296.
- W. Su, H. Ye, S.-Y. Chen, L. Gao, and H. Fu, “Drawinginstyles: Portrait image generation and editing with spatially conditioned stylegan,” IEEE Transactions on Visualization and Computer Graphics, 2022.
- O. Tov, Y. Alaluf, Y. Nitzan, O. Patashnik, and D. Cohen-Or, “Designing an encoder for stylegan image manipulation,” ACM Transactions on Graphics (TOG), vol. 40, no. 4, pp. 1–14, 2021.
- T.-C. Wang, M.-Y. Liu, J.-Y. Zhu, A. Tao, J. Kautz, and B. Catanzaro, “High-resolution image synthesis and semantic manipulation with conditional gans,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 8798–8807.
- E. Simo-Serra, S. Iizuka, K. Sasaki, and H. Ishikawa, “Learning to simplify: fully convolutional networks for rough sketch cleanup,” ACM Transactions on Graphics (TOG), vol. 35, no. 4, pp. 1–11, 2016.
- D. Li, J. Yang, K. Kreis, A. Torralba, and S. Fidler, “Semantic segmentation with generative models: Semi-supervised learning and strong out-of-domain generalization,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021, pp. 8300–8311.
- K. Gong, Y. Gao, X. Liang, X. Shen, M. Wang, and L. Lin, “Graphonomy: Universal human parsing via graph transfer learning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 7450–7459.
- F.-L. Liu, S.-Y. Chen, Y. Lai, C. Li, Y.-R. Jiang, H. Fu, and L. Gao, “Deepfacevideoediting: Sketch-based deep editing of face videos,” ACM Transactions on Graphics(TOG), vol. 41, no. 4, p. 167, 2022.
- J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual losses for real-time style transfer and super-resolution,” in Proceedings of European Conference on Computer Vision. Springer, 2016, pp. 694–711.
- K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
- F. Isensee, J. Petersen, A. Klein, D. Zimmerer, P. F. Jaeger, S. Kohl, J. Wasserthal, G. Koehler, T. Norajitra, S. Wirkert et al., “nnu-net: Self-adapting framework for u-net-based medical image segmentation,” arXiv preprint arXiv:1809.10486, 2018.
- Y. Alaluf, O. Tov, R. Mokady, R. Gal, and A. Bermano, “Hyperstyle: Stylegan inversion with hypernetworks for real image editing,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2022, pp. 18 511–18 521.
- S.-Y. Wang, D. Bau, and J.-Y. Zhu, “Rewriting geometric rules of a gan,” ACM Transactions on Graphics (TOG), vol. 41, no. 4, pp. 1–16, 2022.
- M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “Gans trained by a two time-scale update rule converge to a local nash equilibrium,” Advances in neural information processing systems, vol. 30, 2017.
- T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen, “Improved techniques for training gans,” Advances in neural information processing systems, vol. 29, 2016.
- Z. Liu, P. Luo, S. Qiu, X. Wang, and X. Tang, “Deepfashion: Powering robust clothes recognition and retrieval with rich annotations,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
- T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila, “Analyzing and improving the image quality of stylegan,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 8110–8119.
- Linzi Qu (1 paper)
- Jiaxiang Shang (6 papers)
- Hui Ye (19 papers)
- Xiaoguang Han (118 papers)
- Hongbo Fu (67 papers)