Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

X-Oscar: A Progressive Framework for High-quality Text-guided 3D Animatable Avatar Generation (2405.00954v1)

Published 2 May 2024 in cs.CV

Abstract: Recent advancements in automatic 3D avatar generation guided by text have made significant progress. However, existing methods have limitations such as oversaturation and low-quality output. To address these challenges, we propose X-Oscar, a progressive framework for generating high-quality animatable avatars from text prompts. It follows a sequential Geometry->Texture->Animation paradigm, simplifying optimization through step-by-step generation. To tackle oversaturation, we introduce Adaptive Variational Parameter (AVP), representing avatars as an adaptive distribution during training. Additionally, we present Avatar-aware Score Distillation Sampling (ASDS), a novel technique that incorporates avatar-aware noise into rendered images for improved generation quality during optimization. Extensive evaluations confirm the superiority of X-Oscar over existing text-to-3D and text-to-avatar approaches. Our anonymous project page: https://xmu-xiaoma666.github.io/Projects/X-Oscar/.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (73)
  1. imghum: Implicit generative models of 3d human shape and articulated pose. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  5461–5470, 2021.
  2. Dreamavatar: Text-and-shape guided 3d human avatar generation via diffusion models. arXiv preprint arXiv:2304.00916, 2023.
  3. Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation. arXiv preprint arXiv:2303.13873, 2023a.
  4. Control3d: Towards controllable text-to-3d generation. In Proceedings of the 31st ACM International Conference on Multimedia, pp.  1148–1156, 2023b.
  5. Text-to-3d using gaussian splatting. arXiv preprint arXiv:2309.16585, 2023c.
  6. Reproducible scaling laws for contrastive language-image learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  2818–2829, 2023.
  7. Densepose: Dense human pose estimation in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  7297–7306, 2018.
  8. High-fidelity 3d human digitization from single 2k resolution images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  12869–12879, 2023.
  9. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  10. Avatarclip: Zero-shot text-driven generation and animation of 3d avatars. ACM Transactions on Graphics (TOG), 41(4):1–19, 2022. doi: 10.1145/3528223.3530094.
  11. Humannorm: Learning normal diffusion model for high-quality and realistic 3d human generation. arXiv preprint arXiv:2310.01406, 2023a.
  12. Dreamwaltz: Make a scene with complex 3d animatable avatars. Advances in Neural Information Processing Systems, 2023b.
  13. Zero-shot text-guided object generation with dream fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  867–876, 2022.
  14. Avatarcraft: Transforming text into neural human avatars with parameterized shape and pose control. 2023.
  15. Neuman: Neural human radiance field from a single video. In European Conference on Computer Vision, pp.  402–418. Springer, 2022.
  16. 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics, 42(4), 2023.
  17. Dreamhuman: Animatable 3d avatars from text. Advances in Neural Information Processing Systems, 36, 2024.
  18. Sinddm: A single image denoising diffusion model. In International Conference on Machine Learning (ICML), pp.  17920–17930. PMLR, 2023.
  19. On information and sufficiency. The annals of mathematical statistics, 22(1):79–86, 1951.
  20. Modular primitives for high-performance differentiable rendering. ACM Transactions on Graphics (TOG), 39(6):1–14, 2020.
  21. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International conference on machine learning (ICML), 2022a.
  22. BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning (ICML), 2023a.
  23. Volumetric human teleportation. In ACM SIGGRAPH 2020 Real-Time Live!, pp.  1–1. 2020a.
  24. Monocular real-time volumetric performance capture. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXIII 16, pp.  49–67. Springer, 2020b.
  25. Gaussiandiffusion: 3d gaussian splatting for denoising diffusion probabilistic models with structured noise. arXiv preprint arXiv:2311.11221, 2023b.
  26. Graph jigsaw learning for cartoon face recognition. IEEE Transactions on Image Processing, 31:3961–3972, 2022b.
  27. Mvcontrol: Adding conditional control to multi-view diffusion for controllable text-to-3d generation. arXiv preprint arXiv:2311.14494, 2023c.
  28. Tada! text to animatable digital avatars. 2023a.
  29. High-fidelity clothed avatar reconstruction from a single image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  8662–8672, 2023b.
  30. Magic3d: High-resolution text-to-3d content creation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  300–309, 2023.
  31. Humangaussian: Text-driven 3d human generation with gaussian splatting. arXiv preprint arXiv:2311.17061, 2023.
  32. Smpl: A skinned multi-person linear model. 34(6), oct 2015. ISSN 0730-0301. doi: 10.1145/2816795.2818013. URL https://doi.org/10.1145/2816795.2818013.
  33. X-clip: End-to-end multi-grained contrastive learning for video-text retrieval. In Proceedings of the 30th ACM International Conference on Multimedia, pp.  638–647, 2022.
  34. X-dreamer: Creating high-quality 3d content by bridging the domain gap between text-to-2d and text-to-3d generation. arXiv preprint arXiv:2312.00085, 2023a.
  35. Towards local visual modeling for image captioning. Pattern Recognition, 138:109420, 2023b.
  36. X-mesh: Towards fast and accurate text-driven 3d stylization via dynamic textual guidance. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  2749–2760, 2023c.
  37. En3d: An enhanced generative model for sculpting 3d humans from 2d synthetic data. arXiv preprint arXiv:2401.01173, 2024.
  38. Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1):99–106, 2021.
  39. Clip-mesh: Generating textured meshes from text using pretrained image-text models. In SIGGRAPH Asia 2022 conference papers, pp.  1–8, 2022.
  40. Deepsdf: Learning continuous signed distance functions for shape representation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  165–174, 2019.
  41. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
  42. Expressive body capture: 3d hands, face, and body from a single image. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  10975–10985, 2019a.
  43. Expressive body capture: 3d hands, face, and body from a single image. In Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2019b.
  44. Dreamfusion: Text-to-3d using 2d diffusion. International Conference on Learning Representation (ICLR), 2022.
  45. Learning transferable visual models from natural language supervision. In International conference on machine learning (ICML), pp.  8748–8763. PMLR, 2021.
  46. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  10684–10695, 2022.
  47. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022.
  48. Clip-forge: Towards zero-shot text-to-shape generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  18603–18613, 2022.
  49. Self-supervised collision handling via generative 3d garment models for virtual try-on. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  11763–11773, 2021.
  50. Ulnef: Untangled layered neural fields for mix-and-match virtual try-on. Advances in Neural Information Processing Systems, 35:12110–12125, 2022.
  51. Deep marching tetrahedra: a hybrid representation for high-resolution 3d shape synthesis. Advances in Neural Information Processing Systems, 34:6087–6101, 2021.
  52. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning (ICML), pp.  2256–2265. PMLR, 2015.
  53. Dreamgaussian: Generative gaussian splatting for efficient 3d content creation. arXiv preprint arXiv:2309.16653, 2023.
  54. Diffusers: State-of-the-art diffusion models. https://github.com/huggingface/diffusers, 2022.
  55. Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  12619–12629, 2023a.
  56. Neus: Learning neural implicit surfaces by volume rendering for multi-view reconstruction. Advances in Neural Information Processing Systems (NeurIPS), 2021.
  57. Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation. 2023b.
  58. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th international conference on machine learning (ICML), pp.  681–688, 2011.
  59. Humannerf: Free-viewpoint rendering of moving people from monocular video. In Proceedings of the IEEE/CVF conference on computer vision and pattern Recognition, pp.  16210–16220, 2022.
  60. Econ: Explicit clothed humans optimized via normal integration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  512–523, 2023.
  61. mplug-2: A modularized multi-modal foundation model across text, image and video. In International conference on machine learning (ICML), 2023a.
  62. Seeavatar: Photorealistic text-to-3d avatar generation with constrained geometry and appearance. arXiv preprint arXiv:2312.08889, 2023b.
  63. Gaussiandreamer: Fast generation from text to 3d gaussian splatting with point cloud priors. arXiv preprint arXiv:2310.08529, 2023.
  64. Avatarbooth: High-quality and customizable 3d human avatar generation. arXiv preprint arXiv:2306.09864, 2023.
  65. Avatarverse: High-quality & stable 3d avatar creation from text and pose. arXiv preprint arXiv:2308.03610, 2023a.
  66. Avatarverse: High-quality & stable 3d avatar creation from text and pose. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pp.  7124–7132, 2024.
  67. Avatarstudio: High-fidelity and animatable 3d avatar creation from text. arXiv preprint arXiv:2311.17917, 2023b.
  68. Sprite-from-sprite: Cartoon animation decomposition with self-supervised sprite estimation. ACM Transactions on Graphics (TOG), 41(6):1–12, 2022.
  69. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  3836–3847, 2023c.
  70. Sifu: Side-view conditioned implicit function for real-world usable clothed human reconstruction. arXiv preprint arXiv:2312.06704, 2023d.
  71. Deepmulticap: Performance capture of multiple characters using sparse multiview cameras. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  6239–6249, 2021.
  72. Headstudio: Text to animatable head avatars with 3d gaussian splatting. arXiv preprint arXiv:2402.06149, 2024.
  73. Reconstructing nba players. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part V 16, pp.  177–194. Springer, 2020.
Citations (5)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com