FashionEngine: Interactive 3D Human Generation and Editing via Multimodal Controls (2404.01655v3)
Abstract: We present FashionEngine, an interactive 3D human generation and editing system that creates 3D digital humans via user-friendly multimodal controls such as natural languages, visual perceptions, and hand-drawing sketches. FashionEngine automates the 3D human production with three key components: 1) A pre-trained 3D human diffusion model that learns to model 3D humans in a semantic UV latent space from 2D image training data, which provides strong priors for diverse generation and editing tasks. 2) Multimodality-UV Space encoding the texture appearance, shape topology, and textual semantics of human clothing in a canonical UV-aligned space, which faithfully aligns the user multimodal inputs with the implicit UV latent space for controllable 3D human editing. The multimodality-UV space is shared across different user inputs, such as texts, images, and sketches, which enables various joint multimodal editing tasks. 3) Multimodality-UV Aligned Sampler learns to sample high-quality and diverse 3D humans from the diffusion prior. Extensive experiments validate FashionEngine's state-of-the-art performance for conditional generation/editing tasks. In addition, we present an interactive user interface for our FashionEngine that enables both conditional and unconditional generation tasks, and editing tasks including pose/view/shape control, text-, image-, and sketch-driven 3D human editing and 3D virtual try-on, in a unified framework. Our project page is at: https://taohuumd.github.io/projects/FashionEngine.
- InstructPix2Pix: Learning to Follow Image Editing Instructions. In CVPR.
- Dreamavatar: Text-and-shape guided 3d human avatar generation via diffusion models. arXiv preprint arXiv:2304.00916 (2023).
- Guide3D: Create 3D Avatars from Text and Image Guidance. arXiv preprint arXiv:2308.09705 (2023).
- Efficient geometry-aware 3D generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16123–16133.
- pi-gan: Periodic implicit generative adversarial networks for 3d-aware image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 5799–5809.
- PrimDiffusion: Volumetric Primitives Diffusion for 3D Human Generation. In Thirty-seventh Conference on Neural Information Processing Systems.
- AG3D: Learning to Generate 3D Avatars from 2D Image Collections. ArXiv abs/2305.02312 (2023). https://api.semanticscholar.org/CorpusID:258461509
- InsetGAN for Full-Body Image Generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7723–7732.
- StyleGAN-Human: A Data-Centric Odyssey of Human Generation. In European Conference on Computer Vision. https://api.semanticscholar.org/CorpusID:248377018
- Generative adversarial networks. Commun. ACM 63, 11 (2020), 139–144.
- StylePeople: A Generative Model of Fullbody Human Avatars. 2021 (CVPR) (2021), 5147–5156.
- Nerfdiff: Single-image view synthesis with nerf-guided distillation from 3d-aware diffusion. In International Conference on Machine Learning. PMLR, 11808–11826.
- 3dgen: Triplane latent diffusion for textured mesh generation. arXiv preprint arXiv:2303.05371 (2023).
- Ganspace: Discovering interpretable gan controls. Advances in neural information processing systems 33 (2020), 9841–9850.
- GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. In NIPS.
- EVA3D: Compositional 3D Human Generation from 2D Image Collections. ArXiv abs/2210.04888 (2022). https://api.semanticscholar.org/CorpusID:252780848
- Avatarclip: Zero-shot text-driven generation and animation of 3d avatars. arXiv preprint arXiv:2205.08535 (2022).
- Ming-Kuei Hu. 1962. Visual pattern recognition by moment invariants. IRE Transactions on Information Theory 8, 2 (1962), 179–187. https://doi.org/10.1109/TIT.1962.1057692
- StructLDM: Structured Latent Diffusion for 3D Human Generation. arXiv:2404.01241 [cs.CV]
- SurMo: Surface-based 4D Motion Modeling for Dynamic Human Rendering. arXiv:2404.01225 [cs.CV]
- EgoRenderer: Rendering Human Avatars From Egocentric Camera Images. In ICCV.
- HVTR++: Image and Pose Driven Human Avatars using Hybrid Volumetric-Textural Rendering. IEEE Transactions on Visualization and Computer Graphics (2023), 1–15. https://doi.org/10.1109/TVCG.2023.3297721
- HVTR: Hybrid Volumetric-Textural Rendering for Human Avatars. 3DV (2022).
- Text2Performer: Text-Driven Human Video Generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision.
- Text2Human: Text-Driven Controllable Human Image Generation. ACM Transactions on Graphics (TOG) 41, 4, Article 162 (2022), 11 pages. https://doi.org/10.1145/3528223.3530104
- Heewoo Jun and Alex Nichol. 2023. Shap-e: Generating conditional 3d implicit functions. arXiv preprint arXiv:2305.02463 (2023).
- Alias-Free Generative Adversarial Networks. In Proc. NeurIPS.
- A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 4401–4410.
- Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 8110–8119.
- Tryongan: Body-aware try-on via layered interpolation. ACM Transactions on Graphics (TOG) 40, 4 (2021), 1–10.
- Self-Correction for Human Parsing. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020). https://doi.org/10.1109/TPAMI.2020.3048039
- HyperHuman: Hyper-Realistic Human Generation with Latent Structural Diffusion. arXiv preprint arXiv:2310.08579 (2023).
- Shitong Luo and Wei Hu. 2021. Diffusion probabilistic models for 3d point cloud generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2837–2845.
- NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis. In ECCV.
- DiffRF: Rendering-Guided 3D Radiance Field Diffusion. arXiv preprint arXiv:2212.01206 (2022).
- Point-E: A System for Generating 3D Point Clouds from Complex Prompts. arXiv preprint arXiv:2212.08751 (2022).
- Michael Niemeyer and Andreas Geiger. 2021. Giraffe: Representing scenes as compositional generative neural feature fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11453–11464.
- Unsupervised Learning of Efficient Geometry-Aware Neural Articulated Representations. arXiv preprint arXiv:2204.08839 (2022).
- Stylesdf: High-resolution 3d-consistent image and geometry generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13503–13513.
- Drag Your GAN: Interactive Point-based Manipulation on the Generative Image Manifold. In ACM SIGGRAPH 2023 Conference Proceedings.
- Drag your gan: Interactive point-based manipulation on the generative image manifold. In ACM SIGGRAPH 2023 Conference Proceedings. 1–11.
- Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748–8763.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10684–10695.
- Style and pose control for image synthesis of humans from a single monocular view. arXiv preprint arXiv:2102.11263 (2021).
- HumanGAN: A Generative Model of Humans Images. arXiv preprint arXiv:2103.06902 (2021).
- Yujun Shen and Bolei Zhou. 2021. Closed-form factorization of latent semantics in gans. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 1532–1540.
- 3D Neural Field Generation using Triplane Diffusion. arXiv preprint arXiv:2211.16677 (2022).
- Andrey Voynov and Artem Babenko. 2020. Unsupervised discovery of interpretable directions in the gan latent space. In International conference on machine learning. PMLR, 9786–9796.
- Rodin: A Generative Model for Sculpting 3D Digital Avatars Using Diffusion. arXiv preprint arXiv:2212.06135 (2022).
- Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing 13 (2004), 600–612.
- Dwnet: Dense warp-based network for pose-guided human video generation. arXiv preprint arXiv:1910.09139 (2019).
- 3D Human Mesh Regression With Dense Correspondence. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020), 7052–7061. https://api.semanticscholar.org/CorpusID:219558352
- LION: Latent Point Diffusion Models for 3D Shape Generation. arXiv preprint arXiv:2210.06978 (2022).
- Adding Conditional Control to Text-to-Image Diffusion Models.
- The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. CVPR (2018), 586–595.
- 3d shape generation and completion through point-voxel diffusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 5826–5835.
- Generative visual manipulation on the natural image manifold. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part V 14. Springer, 597–613.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.