Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
153 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

HumanNorm: Learning Normal Diffusion Model for High-quality and Realistic 3D Human Generation (2310.01406v2)

Published 2 Oct 2023 in cs.CV

Abstract: Recent text-to-3D methods employing diffusion models have made significant advancements in 3D human generation. However, these approaches face challenges due to the limitations of text-to-image diffusion models, which lack an understanding of 3D structures. Consequently, these methods struggle to achieve high-quality human generation, resulting in smooth geometry and cartoon-like appearances. In this paper, we propose HumanNorm, a novel approach for high-quality and realistic 3D human generation. The main idea is to enhance the model's 2D perception of 3D geometry by learning a normal-adapted diffusion model and a normal-aligned diffusion model. The normal-adapted diffusion model can generate high-fidelity normal maps corresponding to user prompts with view-dependent and body-aware text. The normal-aligned diffusion model learns to generate color images aligned with the normal maps, thereby transforming physical geometry details into realistic appearance. Leveraging the proposed normal diffusion model, we devise a progressive geometry generation strategy and a multi-step Score Distillation Sampling (SDS) loss to enhance the performance of 3D human generation. Comprehensive experiments substantiate HumanNorm's ability to generate 3D humans with intricate geometry and realistic appearances. HumanNorm outperforms existing text-to-3D methods in both geometry and texture quality. The project page of HumanNorm is https://humannorm.github.io/.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (55)
  1. Musiclm: Generating music from text. arXiv preprint arXiv:2301.11325, 2023.
  2. Panohead: Geometry-aware 3d full-head synthesis in 360∘{}^{\circ}start_FLOATSUPERSCRIPT ∘ end_FLOATSUPERSCRIPT. CVPR, 2023.
  3. Dreamavatar: Text-and-shape guided 3d human avatar generation via diffusion models. arXiv preprint arXiv:2304.00916, 2023.
  4. Efficient geometry-aware 3d generative adversarial networks. In CVPR, pages 16123–16133, 2022.
  5. Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation. arXiv preprint arXiv:2303.13873, 2023a.
  6. It3d: Improved text-to-3d generation with explicit view synthesis. arXiv preprint arXiv:2308.11473, 2023b.
  7. Densepose: Dense human pose estimation in the wild. In CVPR, pages 7297–7306, 2018.
  8. Instruct-nerf2nerf: Editing 3d scenes with instructions. arXiv preprint arXiv:2303.12789, 2023.
  9. Clipscore: A reference-free evaluation metric for image captioning. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 7514–7528, 2021.
  10. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In NeurIPS, pages 6626–6637, 2017.
  11. Learning locally editable virtual humans. In CVPR, pages 21024–21035, 2023.
  12. Avatarclip: Zero-shot text-driven generation and animation of 3d avatars. ACM TOG, 41(4), 2022.
  13. Eva3d: Compositional 3d human generation from 2d image collections. In ICLR, 2023.
  14. Humanliff: Layer-wise 3d human generation with diffusion model. arXiv preprint arXiv:2308.09712, 2023.
  15. Dreamwaltz: Make a scene with complex 3d animatable avatars. arXiv preprint arXiv:2305.12529, 2023.
  16. Zero-shot text-guided object generation with dream fields. In CVPR, pages 867–876, 2022.
  17. Avatarcraft: Transforming text into neural human avatars with parameterized shape and pose control. arXiv preprint arXiv:2303.17606, 2023.
  18. 3d gaussian splatting for real-time radiance field rendering. ACM TOG, 42(4):1–14, 2023.
  19. Dreamhuman: Animatable 3d avatars from text. arXiv preprint arXiv:2306.09329, 2023.
  20. Tada! text to animatable digital avatars. In 3DV, 2024.
  21. Magic3d: High-resolution text-to-3d content creation. In CVPR, pages 300–309, 2023.
  22. SMPL: A skinned multi-person linear model. ACM TOG, 34(6):248:1–248:16, 2015.
  23. Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models. arXiv preprint arXiv:2211.01095, 2022.
  24. Latent-nerf for shape-guided generation of 3d shapes and textures. In CVPR, pages 12663–12673, 2023.
  25. Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1):99–106, 2021.
  26. Clip-mesh: Generating textured meshes from text using pretrained image-text models. In ACM SIGGRAPH Asia Conference Proceedings, pages 1–8, 2022.
  27. Instant neural graphics primitives with a multiresolution hash encoding. ACM TOG, 41(4):1–15, 2022.
  28. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
  29. Parallel wavenet: Fast high-fidelity speech synthesis. In ICML, pages 3918–3926. PMLR, 2018.
  30. Expressive body capture: 3d hands, face, and body from a single image. In CVPR, pages 10975–10985, 2019.
  31. Dreamfusion: Text-to-3d using 2d diffusion. ICLR, 2023.
  32. Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763. PMLR, 2021.
  33. Zero-shot text-to-image generation. In ICML, pages 8821–8831. PMLR, 2021.
  34. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
  35. Texture: Text-guided texturing of 3d shapes. In ACM SIGGRAPH Conference Proceedings, 2023.
  36. High-resolution image synthesis with latent diffusion models. In CVPR, pages 10684–10695, 2022.
  37. Photorealistic text-to-image diffusion models with deep language understanding. NeurIPS, 35:36479–36494, 2022.
  38. Clip-forge: Towards zero-shot text-to-shape generation. In CVPR, pages 18603–18613, 2022.
  39. Deep marching tetrahedra: a hybrid representation for high-resolution 3d shape synthesis. NeurIPS, 34:6087–6101, 2021.
  40. Mvdream: Multi-view diffusion for 3d generation. arXiv preprint arXiv:2308.16512, 2023.
  41. Very deep convolutional networks for large-scale image recognition. ICLR, 2015.
  42. Dreamcraft3d: Hierarchical 3d generation with bootstrapped diffusion prior. arXiv preprint arXiv:2310.16818, 2023.
  43. Dreamgaussian: Generative gaussian splatting for efficient 3d content creation. arXiv preprint arXiv:2309.16653, 2023.
  44. Human motion diffusion model. In ICLR, 2023.
  45. TwinDom. Twindom. https://web.twindom.com/. Accessed: 2023-09-28.
  46. Neus: Learning neural implicit surfaces by volume rendering for multi-view reconstruction. NeurIPS, 34:27171–27183, 2021.
  47. Rodin: A generative model for sculpting 3d digital avatars using diffusion. In CVPR, pages 4563–4573, 2023a.
  48. Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation. arXiv preprint arXiv:2305.16213, 2023b.
  49. Get3dhuman: Lifting stylegan-human into a 3d generative model using pixel-aligned reconstruction priors. In ICCV, pages 9287–9297, 2023.
  50. Efficient 3d articulated human generation with layered surface volumes. arXiv preprint arXiv:2307.05462, 2023.
  51. Function4d: Real-time human volumetric capture from very sparse consumer rgbd sensors. In CVPR, pages 5746–5756, 2021.
  52. Avatarbooth: High-quality and customizable 3d human avatar generation. arXiv preprint arXiv:2306.09864, 2023.
  53. Avatarverse: High-quality & stable 3d avatar creation from text and pose. arXiv preprint arXiv:2308.03610, 2023a.
  54. Adding conditional control to text-to-image diffusion models. arXiv preprint arXiv:2302.05543, 2023.
  55. Getavatar: Generative textured meshes for animatable human avatars. In ICCV, pages 2273–2282, 2023b.
Citations (46)

Summary

We haven't generated a summary for this paper yet.

Github Logo Streamline Icon: https://streamlinehq.com

GitHub