X-Oscar: A Progressive Framework for High-quality Text-guided 3D Animatable Avatar Generation (2405.00954v1)
Abstract: Recent advancements in automatic 3D avatar generation guided by text have made significant progress. However, existing methods have limitations such as oversaturation and low-quality output. To address these challenges, we propose X-Oscar, a progressive framework for generating high-quality animatable avatars from text prompts. It follows a sequential Geometry->Texture->Animation paradigm, simplifying optimization through step-by-step generation. To tackle oversaturation, we introduce Adaptive Variational Parameter (AVP), representing avatars as an adaptive distribution during training. Additionally, we present Avatar-aware Score Distillation Sampling (ASDS), a novel technique that incorporates avatar-aware noise into rendered images for improved generation quality during optimization. Extensive evaluations confirm the superiority of X-Oscar over existing text-to-3D and text-to-avatar approaches. Our anonymous project page: https://xmu-xiaoma666.github.io/Projects/X-Oscar/.
- imghum: Implicit generative models of 3d human shape and articulated pose. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5461–5470, 2021.
- Dreamavatar: Text-and-shape guided 3d human avatar generation via diffusion models. arXiv preprint arXiv:2304.00916, 2023.
- Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation. arXiv preprint arXiv:2303.13873, 2023a.
- Control3d: Towards controllable text-to-3d generation. In Proceedings of the 31st ACM International Conference on Multimedia, pp. 1148–1156, 2023b.
- Text-to-3d using gaussian splatting. arXiv preprint arXiv:2309.16585, 2023c.
- Reproducible scaling laws for contrastive language-image learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2818–2829, 2023.
- Densepose: Dense human pose estimation in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7297–7306, 2018.
- High-fidelity 3d human digitization from single 2k resolution images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12869–12879, 2023.
- Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
- Avatarclip: Zero-shot text-driven generation and animation of 3d avatars. ACM Transactions on Graphics (TOG), 41(4):1–19, 2022. doi: 10.1145/3528223.3530094.
- Humannorm: Learning normal diffusion model for high-quality and realistic 3d human generation. arXiv preprint arXiv:2310.01406, 2023a.
- Dreamwaltz: Make a scene with complex 3d animatable avatars. Advances in Neural Information Processing Systems, 2023b.
- Zero-shot text-guided object generation with dream fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 867–876, 2022.
- Avatarcraft: Transforming text into neural human avatars with parameterized shape and pose control. 2023.
- Neuman: Neural human radiance field from a single video. In European Conference on Computer Vision, pp. 402–418. Springer, 2022.
- 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics, 42(4), 2023.
- Dreamhuman: Animatable 3d avatars from text. Advances in Neural Information Processing Systems, 36, 2024.
- Sinddm: A single image denoising diffusion model. In International Conference on Machine Learning (ICML), pp. 17920–17930. PMLR, 2023.
- On information and sufficiency. The annals of mathematical statistics, 22(1):79–86, 1951.
- Modular primitives for high-performance differentiable rendering. ACM Transactions on Graphics (TOG), 39(6):1–14, 2020.
- Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International conference on machine learning (ICML), 2022a.
- BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning (ICML), 2023a.
- Volumetric human teleportation. In ACM SIGGRAPH 2020 Real-Time Live!, pp. 1–1. 2020a.
- Monocular real-time volumetric performance capture. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXIII 16, pp. 49–67. Springer, 2020b.
- Gaussiandiffusion: 3d gaussian splatting for denoising diffusion probabilistic models with structured noise. arXiv preprint arXiv:2311.11221, 2023b.
- Graph jigsaw learning for cartoon face recognition. IEEE Transactions on Image Processing, 31:3961–3972, 2022b.
- Mvcontrol: Adding conditional control to multi-view diffusion for controllable text-to-3d generation. arXiv preprint arXiv:2311.14494, 2023c.
- Tada! text to animatable digital avatars. 2023a.
- High-fidelity clothed avatar reconstruction from a single image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8662–8672, 2023b.
- Magic3d: High-resolution text-to-3d content creation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 300–309, 2023.
- Humangaussian: Text-driven 3d human generation with gaussian splatting. arXiv preprint arXiv:2311.17061, 2023.
- Smpl: A skinned multi-person linear model. 34(6), oct 2015. ISSN 0730-0301. doi: 10.1145/2816795.2818013. URL https://doi.org/10.1145/2816795.2818013.
- X-clip: End-to-end multi-grained contrastive learning for video-text retrieval. In Proceedings of the 30th ACM International Conference on Multimedia, pp. 638–647, 2022.
- X-dreamer: Creating high-quality 3d content by bridging the domain gap between text-to-2d and text-to-3d generation. arXiv preprint arXiv:2312.00085, 2023a.
- Towards local visual modeling for image captioning. Pattern Recognition, 138:109420, 2023b.
- X-mesh: Towards fast and accurate text-driven 3d stylization via dynamic textual guidance. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2749–2760, 2023c.
- En3d: An enhanced generative model for sculpting 3d humans from 2d synthetic data. arXiv preprint arXiv:2401.01173, 2024.
- Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1):99–106, 2021.
- Clip-mesh: Generating textured meshes from text using pretrained image-text models. In SIGGRAPH Asia 2022 conference papers, pp. 1–8, 2022.
- Deepsdf: Learning continuous signed distance functions for shape representation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 165–174, 2019.
- Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
- Expressive body capture: 3d hands, face, and body from a single image. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10975–10985, 2019a.
- Expressive body capture: 3d hands, face, and body from a single image. In Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2019b.
- Dreamfusion: Text-to-3d using 2d diffusion. International Conference on Learning Representation (ICLR), 2022.
- Learning transferable visual models from natural language supervision. In International conference on machine learning (ICML), pp. 8748–8763. PMLR, 2021.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10684–10695, 2022.
- Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022.
- Clip-forge: Towards zero-shot text-to-shape generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18603–18613, 2022.
- Self-supervised collision handling via generative 3d garment models for virtual try-on. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11763–11773, 2021.
- Ulnef: Untangled layered neural fields for mix-and-match virtual try-on. Advances in Neural Information Processing Systems, 35:12110–12125, 2022.
- Deep marching tetrahedra: a hybrid representation for high-resolution 3d shape synthesis. Advances in Neural Information Processing Systems, 34:6087–6101, 2021.
- Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning (ICML), pp. 2256–2265. PMLR, 2015.
- Dreamgaussian: Generative gaussian splatting for efficient 3d content creation. arXiv preprint arXiv:2309.16653, 2023.
- Diffusers: State-of-the-art diffusion models. https://github.com/huggingface/diffusers, 2022.
- Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12619–12629, 2023a.
- Neus: Learning neural implicit surfaces by volume rendering for multi-view reconstruction. Advances in Neural Information Processing Systems (NeurIPS), 2021.
- Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation. 2023b.
- Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th international conference on machine learning (ICML), pp. 681–688, 2011.
- Humannerf: Free-viewpoint rendering of moving people from monocular video. In Proceedings of the IEEE/CVF conference on computer vision and pattern Recognition, pp. 16210–16220, 2022.
- Econ: Explicit clothed humans optimized via normal integration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 512–523, 2023.
- mplug-2: A modularized multi-modal foundation model across text, image and video. In International conference on machine learning (ICML), 2023a.
- Seeavatar: Photorealistic text-to-3d avatar generation with constrained geometry and appearance. arXiv preprint arXiv:2312.08889, 2023b.
- Gaussiandreamer: Fast generation from text to 3d gaussian splatting with point cloud priors. arXiv preprint arXiv:2310.08529, 2023.
- Avatarbooth: High-quality and customizable 3d human avatar generation. arXiv preprint arXiv:2306.09864, 2023.
- Avatarverse: High-quality & stable 3d avatar creation from text and pose. arXiv preprint arXiv:2308.03610, 2023a.
- Avatarverse: High-quality & stable 3d avatar creation from text and pose. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pp. 7124–7132, 2024.
- Avatarstudio: High-fidelity and animatable 3d avatar creation from text. arXiv preprint arXiv:2311.17917, 2023b.
- Sprite-from-sprite: Cartoon animation decomposition with self-supervised sprite estimation. ACM Transactions on Graphics (TOG), 41(6):1–12, 2022.
- Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3836–3847, 2023c.
- Sifu: Side-view conditioned implicit function for real-world usable clothed human reconstruction. arXiv preprint arXiv:2312.06704, 2023d.
- Deepmulticap: Performance capture of multiple characters using sparse multiview cameras. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6239–6249, 2021.
- Headstudio: Text to animatable head avatars with 3d gaussian splatting. arXiv preprint arXiv:2402.06149, 2024.
- Reconstructing nba players. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part V 16, pp. 177–194. Springer, 2020.