CharacterGen: Efficient 3D Character Generation from Single Images with Multi-View Pose Canonicalization (2402.17214v3)
Abstract: In the field of digital content creation, generating high-quality 3D characters from single images is challenging, especially given the complexities of various body poses and the issues of self-occlusion and pose ambiguity. In this paper, we present CharacterGen, a framework developed to efficiently generate 3D characters. CharacterGen introduces a streamlined generation pipeline along with an image-conditioned multi-view diffusion model. This model effectively calibrates input poses to a canonical form while retaining key attributes of the input image, thereby addressing the challenges posed by diverse poses. A transformer-based, generalizable sparse-view reconstruction model is the other core component of our approach, facilitating the creation of detailed 3D models from multi-view images. We also adopt a texture-back-projection strategy to produce high-quality texture maps. Additionally, we have curated a dataset of anime characters, rendered in multiple poses and views, to train and evaluate our model. Our approach has been thoroughly evaluated through quantitative and qualitative experiments, showing its proficiency in generating 3D characters with high-quality shapes and textures, ready for downstream applications such as rigging and animation.
- actorcore. accurig, a software for automatic character rigging, 2023.
- imghum: Implicit generative models of 3d human shape and articulated pose. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, pages 5441–5450. IEEE, 2021.
- Pretrain, self-train, distill: A simple recipe for supersizing 3d reconstruction. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 3763–3772. IEEE, 2022.
- Dreamavatar: Text-and-shape guided 3d human avatar generation via diffusion models. CoRR, abs/2304.00916, 2023.
- Openpose: Realtime multi-person 2d pose estimation using part affinity fields. IEEE Trans. Pattern Anal. Mach. Intell., 43(1):172–186, 2021.
- Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation. CoRR, abs/2303.13873, 2023a.
- Panic-3d: Stylized single-view 3d reconstruction from portraits of anime characters. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 21068–21077. IEEE, 2023b.
- Objaverse: A universe of annotated 3d objects. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 13142–13153. IEEE, 2023.
- Nerdi: Single-view nerf synthesis with language-guided diffusion as general image priors. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 20637–20647. IEEE, 2023.
- Hyperdiffusion: Generating implicit neural fields with weight-space diffusion. CoRR, abs/2303.17015, 2023.
- HakuyaLabs. warudo, a 3d virtual image live broadcast software, 2023.
- Avatarclip: zero-shot text-driven generation and animation of 3d avatars. ACM Trans. Graph., 41(4):161:1–161:19, 2022.
- EVA3D: compositional 3d human generation from 2d image collections. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023a.
- LRM: large reconstruction model for single image to 3d. CoRR, abs/2311.04400, 2023b.
- Animate anyone: Consistent and controllable image-to-video synthesis for character animation. CoRR, abs/2311.17117, 2023.
- Dreamwaltz: Make a scene with complex 3d animatable avatars. CoRR, abs/2305.12529, 2023a.
- Tech: Text-guided reconstruction of lifelike clothed humans. CoRR, abs/2308.08545, 2023b.
- Neural wavelet-domain diffusion for 3d shape generation. In SIGGRAPH Asia 2022 Conference Papers, SA 2022, Daegu, Republic of Korea, December 6-9, 2022, pages 24:1–24:9. ACM, 2022.
- Avatarcraft: Transforming text into neural human avatars with parameterized shape and pose control. CoRR, abs/2303.17606, 2023.
- Shap-e: Generating conditional 3d implicit functions. CoRR, abs/2305.02463, 2023.
- Dreamhuman: Animatable 3d avatars from text. CoRR, abs/2306.09329, 2023.
- Modular primitives for high-performance differentiable rendering. ACM Trans. Graph., 39(6):194:1–194:14, 2020.
- Instant3d: Fast text-to-3d with sparse-view generation and large reconstruction model. CoRR, abs/2311.06214, 2023.
- Learning a model of facial shape and expression from 4d scans. ACM Trans. Graph., 36(6):194:1–194:17, 2017.
- Tada! text to animatable digital avatars. CoRR, abs/2308.10899, 2023.
- Magic3d: High-resolution text-to-3d content creation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 300–309. IEEE, 2023a.
- Common diffusion noise schedules and sample steps are flawed. CoRR, abs/2305.08891, 2023b.
- One-2-3-45++: Fast single image to 3d objects with consistent multi-view generation and 3d diffusion. CoRR, abs/2311.07885, 2023a.
- One-2-3-45: Any single image to 3d mesh in 45 seconds without per-shape optimization. CoRR, abs/2306.16928, 2023b.
- Zero-1-to-3: Zero-shot one image to 3d object. CoRR, abs/2303.11328, 2023c.
- Syncdreamer: Generating multiview-consistent images from a single-view image. CoRR, abs/2309.03453, 2023d.
- Wonder3d: Single image to 3d using cross-domain diffusion. CoRR, abs/2310.15008, 2023.
- SMPL: a skinned multi-person linear model. ACM Trans. Graph., 34(6):248:1–248:16, 2015.
- Realfusion 360° reconstruction of any object from a single image. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 8446–8455. IEEE, 2023.
- Latent-nerf for shape-guided generation of 3d shapes and textures. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 12663–12673. IEEE, 2023.
- Nerf: representing scenes as neural radiance fields for view synthesis. Commun. ACM, 65(1):99–106, 2022.
- MixamoInc. Mixamo’s online services, 2009.
- MooreThreads. Moore-animateanyone. https://github.com/MooreThreads/Moore-AnimateAnyone, 2024.
- Diffrf: Rendering-guided 3d radiance field diffusion. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 4328–4338. IEEE, 2023.
- Extracting triangular 3d models, materials, and lighting from images. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 8270–8280. IEEE, 2022.
- Point-e: A system for generating 3d point clouds from complex prompts. CoRR, abs/2212.08751, 2022.
- Expressive body capture: 3d hands, face, and body from a single image. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pages 10975–10985. Computer Vision Foundation / IEEE, 2019.
- Poisson image editing. ACM Trans. Graph., 22(3):313–318, 2003.
- Pixiv. Vrm tools of three.js, 2019.
- Dreamfusion: Text-to-3d using 2d diffusion. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023.
- Magic123: One image to high-quality 3d object generation using both 2d and 3d diffusion priors. CoRR, abs/2306.17843, 2023.
- Dreambooth3d: Subject-driven text-to-3d generation. In IEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023, pages 2349–2359. IEEE, 2023.
- Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 22500–22510. IEEE, 2023.
- Pifu: Pixel-aligned implicit function for high-resolution clothed human digitization. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019, pages 2304–2314. IEEE, 2019.
- Pifuhd: Multi-level pixel-aligned implicit function for high-resolution 3d human digitization. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pages 81–90. Computer Vision Foundation / IEEE, 2020.
- Deep marching tetrahedra: a hybrid representation for high-resolution 3d shape synthesis. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 6087–6101, 2021.
- Zero123++: a single image to consistent multi-view diffusion base model. CoRR, abs/2310.15110, 2023a.
- Mvdream: Multi-view diffusion for 3d generation. CoRR, abs/2308.16512, 2023b.
- Make-it-3d: High-fidelity 3d creation from A single image with diffusion prior. In IEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023, pages 22762–22772. IEEE, 2023a.
- Mvdiffusion: Enabling holistic multi-view image generation with correspondence-aware diffusion. CoRR, abs/2307.01097, 2023b.
- 3d reconstruction of novel object shapes from single images. In International Conference on 3D Vision, 3DV 2021, London, United Kingdom, December 1-3, 2021, pages 85–95. IEEE, 2021.
- VRoid. Vroid hub, 2022.
- Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 12619–12629. IEEE, 2023a.
- Pixel2mesh: 3d mesh model generation via image guided deformation. IEEE Trans. Pattern Anal. Mach. Intell., 43(10):3600–3613, 2021.
- Imagedream: Image-prompt multi-view diffusion for 3d generation. CoRR, abs/2312.02201, 2023.
- RODIN: A generative model for sculpting 3d digital avatars using diffusion. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 4563–4573. IEEE, 2023b.
- Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation. CoRR, abs/2305.16213, 2023c.
- Pixel2mesh++: 3d mesh generation and refinement from multi-view images. IEEE Trans. Pattern Anal. Mach. Intell., 45(2):2166–2180, 2023.
- Multiview compressive coding for 3d reconstruction. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 9065–9075. IEEE, 2023a.
- Omniobject3d: Large-vocabulary 3d object dataset for realistic perception, reconstruction and generation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 803–814. IEEE, 2023b.
- ICON: implicit clothed humans obtained from normals. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 13286–13296. IEEE, 2022.
- ECON: explicit clothed humans optimized via normal integration. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 512–523. IEEE, 2023.
- Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models. CoRR, abs/2308.06721, 2023.
- Dreamsparse: Escaping from plato’s cave with 2d frozen diffusion model given sparse views. CoRR, abs/2306.03414, 2023.
- LION: latent point diffusion models for 3d shape generation. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, 2022.
- 3dshape2vecset: A 3d shape representation for neural fields and generative diffusion models. ACM Trans. Graph., 42(4):92:1–92:16, 2023a.
- Avatarverse: High-quality & stable 3d avatar creation from text and pose. CoRR, abs/2308.03610, 2023b.
- Avatarstudio: High-fidelity and animatable 3d avatar creation from text. CoRR, abs/2311.17917, 2023c.
- Adding conditional control to text-to-image diffusion models. CoRR, abs/2302.05543, 2023.
- The unreasonable effectiveness of deep features as a perceptual metric. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 586–595. Computer Vision Foundation / IEEE Computer Society, 2018.
- Pamir: Parametric model-conditioned implicit representation for image-based human reconstruction. IEEE Trans. Pattern Anal. Mach. Intell., 44(6):3170–3184, 2022.