Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
126 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

3D-aware Image Generation and Editing with Multi-modal Conditions (2403.06470v1)

Published 11 Mar 2024 in cs.CV

Abstract: 3D-consistent image generation from a single 2D semantic label is an important and challenging research topic in computer graphics and computer vision. Although some related works have made great progress in this field, most of the existing methods suffer from poor disentanglement performance of shape and appearance, and lack multi-modal control. In this paper, we propose a novel end-to-end 3D-aware image generation and editing model incorporating multiple types of conditional inputs, including pure noise, text and reference image. On the one hand, we dive into the latent space of 3D Generative Adversarial Networks (GANs) and propose a novel disentanglement strategy to separate appearance features from shape features during the generation process. On the other hand, we propose a unified framework for flexible image generation and editing tasks with multi-modal conditions. Our method can generate diverse images with distinct noises, edit the attribute through a text description and conduct style transfer by giving a reference RGB image. Extensive experiments demonstrate that the proposed method outperforms alternative approaches both qualitatively and quantitatively on image generation and editing.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (58)
  1. W. Shang, K. Sohn, D. Almeida, and H. Lee, “Understanding and improving convolutional neural networks via concatenated rectified linear units,” in international conference on machine learning.   PMLR, 2016, pp. 2217–2225.
  2. I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial networks,” Communications of the ACM, vol. 63, no. 11, pp. 139–144, 2020.
  3. T. Park, M.-Y. Liu, T.-C. Wang, and J.-Y. Zhu, “Semantic image synthesis with spatially-adaptive normalization,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 2337–2346.
  4. P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with conditional adversarial networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1125–1134.
  5. E. R. Chan, C. Z. Lin, M. A. Chan, K. Nagano, B. Pan, S. De Mello, O. Gallo, L. Guibas, J. Tremblay, S. Khamis et al., “Efficient geometry-aware 3D generative adversarial networks,” arXiv preprint arXiv:2112.07945, 2021.
  6. T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila, “Analyzing and improving the image quality of StyleGAN,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
  7. J. Sun, X. Wang, Y. Zhang, X. Li, Q. Zhang, Y. Liu, and J. Wang, “FENeRF: Face editing in neural radiance fields,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 7672–7682.
  8. J. Zhang, E. Sangineto, H. Tang, A. Siarohin, Z. Zhong, N. Sebe, and W. Wang, “3D-aware semantic-guided generative model for human synthesis,” in European Conference on Computer Vision.   Springer, 2022, pp. 339–356.
  9. R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10 684–10 695.
  10. B. Poole, A. Jain, J. T. Barron, and B. Mildenhall, “DreamFusion: Text-to-3D using 2D diffusion,” arXiv preprint arXiv:2209.14988, 2022.
  11. H. Wang, X. Du, J. Li, R. A. Yeh, and G. Shakhnarovich, “Score Jacobian chaining: Lifting pretrained 2D diffusion models for 3D generation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 12 619–12 629.
  12. J. Seo, W. Jang, M.-S. Kwak, J. Ko, H. Kim, J. Kim, J.-H. Kim, J. Lee, and S. Kim, “Let 2D diffusion model know 3D-consistency for robust text-to-3D generation,” arXiv preprint arXiv:2303.07937, 2023.
  13. K. Song, L. Han, B. Liu, D. Metaxas, and A. Elgammal, “Diffusion guided domain adaptation of image generators,” arXiv preprint arXiv:2212.04473, 2022.
  14. K. Deng, G. Yang, D. Ramanan, and J.-Y. Zhu, “3D-aware conditional image synthesis,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 4434–4445.
  15. K. Schwarz, Y. Liao, M. Niemeyer, and A. Geiger, “GRAF: Generative radiance fields for 3D-aware image synthesis,” in Advances in Neural Information Processing Systems (NeurIPS), 2020.
  16. E. R. Chan, M. Monteiro, P. Kellnhofer, J. Wu, and G. Wetzstein, “pi-gan: Periodic implicit generative adversarial networks for 3d-aware image synthesis,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 5799–5809.
  17. M. Niemeyer and A. Geiger, “GIRAFFE: Representing scenes as compositional generative neural feature fields,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
  18. J. Gu, L. Liu, P. Wang, and C. Theobalt, “StyleNeRF: A style-based 3D-aware generator for high-resolution image synthesis,” arXiv preprint arXiv:2110.08985, 2021.
  19. X. Zhang, Z. Zheng, D. Gao, B. Zhang, P. Pan, and Y. Yang, “Multi-view consistent generative adversarial networks for 3D-aware image synthesis,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 18 450–18 459.
  20. R. Or-El, X. Luo, M. Shan, E. Shechtman, J. J. Park, and I. Kemelmacher-Shlizerman, “StyleSDF: High-resolution 3d-consistent image and geometry generation,” arXiv e-prints, pp. arXiv–2112, 2021.
  21. Z. Shi, Y. Shen, Y. Xu, S. Peng, Y. Liao, S. Guo, Q. Chen, and D.-Y. Yeung, “Learning 3D-aware image synthesis with unknown pose distribution,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 13 062–13 071.
  22. T. Xu, P. Zhang, Q. Huang, H. Zhang, Z. Gan, X. Huang, and X. He, “AttnGAN: Fine-grained text to image generation with attentional generative adversarial networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 1316–1324.
  23. H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. N. Metaxas, “StackGAN: Text to photo-realistic image synthesis with stacked generative adversarial networks,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 5907–5915.
  24. ——, “StackGAN++: Realistic image synthesis with stacked generative adversarial networks,” IEEE transactions on pattern analysis and machine intelligence, vol. 41, no. 8, pp. 1947–1962, 2018.
  25. J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in IEEE International Conference on Computer Vision (ICCV), 2017.
  26. T. Karras, S. Laine, and T. Aila, “A style-based generator architecture for generative adversarial networks,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  27. T. Karras, M. Aittala, S. Laine, E. Härkönen, J. Hellsten, J. Lehtinen, and T. Aila, “Alias-free generative adversarial networks,” in Advances in Neural Information Processing Systems (NeurIPS), 2021.
  28. Y. Shen, C. Yang, X. Tang, and B. Zhou, “InterfaceGAN: Interpreting the disentangled face representation learned by gans,” IEEE transactions on pattern analysis and machine intelligence, 2020.
  29. R. Abdal, P. Zhu, N. J. Mitra, and P. Wonka, “StyleFlow: Attribute-conditioned exploration of StyleGAN-generated images using conditional continuous normalizing flows,” ACM Transactions on Graphics (TOG), vol. 40, no. 3, pp. 1–21, 2021.
  30. E. Härkönen, A. Hertzmann, J. Lehtinen, and S. Paris, “GANSpace: Discovering interpretable GAN controls,” arXiv preprint arXiv:2004.02546, 2020.
  31. Y. Shen and B. Zhou, “Closed-form factorization of latent semantics in gans,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 1532–1540.
  32. O. Patashnik, Z. Wu, E. Shechtman, D. Cohen-Or, and D. Lischinski, “StyleCLIP: Text-driven manipulation of stylegan imagery,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 2085–2094.
  33. Y. Cheng, F. Yin, X. Huang, X. Yu, J. Liu, S. Feng, Y. Yang, and Y. Tang, “Efficient text-guided 3D-aware portrait generation with score distillation sampling on distribution,” arXiv preprint arXiv:2306.02083, 2023.
  34. J. Sun, X. Wang, Y. Shi, L. Wang, J. Wang, and Y. Liu, “IDE-3D: Interactive disentangled editing for high-resolution 3D-aware portrait synthesis,” arXiv preprint arXiv:2205.15517, 2022.
  35. V. Blanz and T. Vetter, “A morphable model for the synthesis of 3D faces,” in Seminal Graphics Papers: Pushing the Boundaries, Volume 2, 2023, pp. 157–164.
  36. A. Brunton, A. Salazar, T. Bolkart, and S. Wuhrer, “Review of statistical shape spaces for 3D data with comparative analysis for human faces,” Computer Vision and Image Understanding, vol. 128, pp. 1–17, 2014.
  37. T. Bolkart and S. Wuhrer, “A robust multilinear model learning framework for 3D faces,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 4911–4919.
  38. J. Booth, A. Roussos, S. Zafeiriou, A. Ponniah, and D. Dunaway, “A 3D morphable model learnt from 10,000 faces,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 5543–5552.
  39. C. Cao, Y. Weng, S. Zhou, Y. Tong, and K. Zhou, “Facewarehouse: A 3D facial expression database for visual computing,” IEEE Transactions on Visualization and Computer Graphics, vol. 20, no. 3, pp. 413–425, 2013.
  40. T. Li, T. Bolkart, M. J. Black, H. Li, and J. Romero, “Learning a model of facial shape and expression from 4D scans,” ACM Trans. Graph., vol. 36, no. 6, pp. 194–1, 2017.
  41. M. Wang, Y. Panagakis, P. Snape, and S. Zafeiriou, “Learning the multilinear structure of visual data,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 4592–4600.
  42. B. Gecer, S. Ploumpis, I. Kotsia, and S. Zafeiriou, “GANFIT: Generative adversarial network fitting for high fidelity 3D face reconstruction,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 1155–1164.
  43. A. Lattas, S. Moschoglou, B. Gecer, S. Ploumpis, V. Triantafyllou, A. Ghosh, and S. Zafeiriou, “AvatarMe: Realistically renderable 3D facial reconstruction “in-the-wild”’,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 760–769.
  44. B. Gecer, A. Lattas, S. Ploumpis, J. Deng, A. Papaioannou, S. Moschoglou, and S. Zafeiriou, “Synthesizing coupled 3D face modalities by trunk-branch generative adversarial networks,” in European conference on computer vision.   Springer, 2020, pp. 415–433.
  45. B. Gecer, B. Bhattarai, J. Kittler, and T.-K. Kim, “Semi-supervised adversarial learning to generate photorealistic face images of new identities from 3D morphable model,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 217–234.
  46. M. Sela, E. Richardson, and R. Kimmel, “Unrestricted facial geometry reconstruction using image-to-image translation,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 1576–1585.
  47. C. Yu, J. Wang, C. Peng, C. Gao, G. Yu, and N. Sang, “BiSeNet: Bilateral segmentation network for real-time semantic segmentation,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 325–341.
  48. S. Amir, Y. Gandelsman, S. Bagon, and T. Dekel, “Deep vit features as dense visual descriptors,” arXiv preprint arXiv:2112.05814, vol. 2, no. 3, p. 4, 2021.
  49. L. A. Gatys, A. S. Ecker, and M. Bethge, “Image style transfer using convolutional neural networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2414–2423.
  50. Y. Jiang, Z. Huang, X. Pan, C. C. Loy, and Z. Liu, “Talk-to-edit: Fine-grained facial editing via dialog,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 13 799–13 808.
  51. C.-H. Lee, Z. Liu, L. Wu, and P. Luo, “MaskGAN: Towards diverse and interactive facial image manipulation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 5549–5558.
  52. B. Doosti, S. Naha, M. Mirbagheri, and D. J. Crandall, “HOPE-Net: A graph-based model for hand-object pose estimation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 6608–6617.
  53. Y. Choi, Y. Uh, J. Yoo, and J.-W. Ha, “StarGAN v2: Diverse image synthesis for multiple domains,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020.
  54. P. Zhu, R. Abdal, Y. Qin, and P. Wonka, “SEAN: Image synthesis with semantic region-adaptive normalization,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 5104–5113.
  55. A. Chen, R. Liu, L. Xie, Z. Chen, H. Su, and J. Yu, “SofGAN: A portrait image generator with dynamic styling,” ACM Transactions on Graphics (TOG), vol. 41, no. 1, pp. 1–26, 2022.
  56. M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “GANs trained by a two time-scale update rule converge to a local Nash equilibrium,” arXiv preprint arXiv:1706.08500, 2017.
  57. M. Bińkowski, D. J. Sutherland, M. Arbel, and A. Gretton, “Demystifying MMD GANs,” in International Conference on Learning Representations (ICLR), 2018.
  58. F. Schroff, D. Kalenichenko, and J. Philbin, “FaceNet: A unified embedding for face recognition and clustering,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 815–823.
Citations (2)

Summary

We haven't generated a summary for this paper yet.