TEDRA: Text-based Editing of Dynamic and Photoreal Actors (2408.15995v1)
Abstract: Over the past years, significant progress has been made in creating photorealistic and drivable 3D avatars solely from videos of real humans. However, a core remaining challenge is the fine-grained and user-friendly editing of clothing styles by means of textual descriptions. To this end, we present TEDRA, the first method allowing text-based edits of an avatar, which maintains the avatar's high fidelity, space-time coherency, as well as dynamics, and enables skeletal pose and view control. We begin by training a model to create a controllable and high-fidelity digital replica of the real actor. Next, we personalize a pretrained generative diffusion model by fine-tuning it on various frames of the real character captured from different camera angles, ensuring the digital representation faithfully captures the dynamics and movements of the real person. This two-stage process lays the foundation for our approach to dynamic human avatar editing. Utilizing this personalized diffusion model, we modify the dynamic avatar based on a provided text prompt using our Personalized Normal Aligned Score Distillation Sampling (PNA-SDS) within a model-based guidance framework. Additionally, we propose a time step annealing strategy to ensure high-quality edits. Our results demonstrate a clear improvement over prior work in functionality and visual quality.
- Generative neural articulated radiance fields. NeurIPS, 35:19900–19916, 2022.
- Instructpix2pix: Learning to follow image editing instructions. In CVPR, 2023a.
- Instructpix2pix: Learning to follow image editing instructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18392–18402, 2023b.
- Dreamavatar: Text-and-shape guided 3d human avatar generation via diffusion models. arXiv preprint arXiv:2304.00916, 2023.
- 4d video textures for interactive character appearance. Comput. Graph. Forum, 33(2):371–380, 2014.
- Animatable neural radiance fields from monocular rgb videos. arXiv preprint arXiv:2106.13629, 2021.
- Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023.
- CompVis. Stable diffusion, 2022.
- Capturing and animation of body and clothing from monocular video. In SIGGRAPH Asia 2022 Conference Papers, 2022.
- Neural novel actor: Learning a generalized animatable neural representation for human actors. IEEE TVCG, 2023.
- Deepcap: Monocular human performance capture using weak supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5052–5063, 2020.
- Real-time deep dynamic characters. ACM TOG, 40(4), 2021.
- Hdhumans: A hybrid approach for high-fidelity digital humans. Proceedings of the ACM on Computer Graphics and Interactive Techniques, 6(3):1–23, 2023.
- Instruct-nerf2nerf: Editing 3d scenes with instructions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023.
- Escaping plato’s cave: 3d shape from adversarial rendering. In CVPR, pages 9984–9993, 2019.
- Classifier-free diffusion guidance, 2022.
- Denoising diffusion probabilistic models, 2020.
- Avatarclip: Zero-shot text-driven generation and animation of 3d avatars. ACM Transactions on Graphics (TOG), 41(4):1–19, 2022.
- Sherf: Generalizable human nerf from a single image. arXiv preprint, 2023.
- Humannorm: Learning normal diffusion model for high-quality and realistic 3d human generation. arXiv, 2023a.
- Dreamwaltz: Make a scene with complex 3d animatable avatars. 2023b.
- Tech: Text-guided reconstruction of lifelike clothed humans, 2023c.
- Zero-shot text-guided object generation with dream fields. CVPR, 2022.
- Avatarcraft: Transforming text into neural human avatars with parameterized shape and pose control, 2023a.
- In CVPR, pages 16922–16932, 2023b.
- Total capture: A 3d deformation model for tracking faces, hands, and bodies. In CVPR, pages 8320–8329, 2018.
- Skinning with dual quaternions. In Proceedings of the 2007 symposium on Interactive 3D graphics and games, pages 39–46, 2007.
- Dreamhuman: Animatable 3d avatars from text. 2023.
- Deliffas: Deformable light fields for fast avatar synthesis. NeurIPS, 2023.
- Volumetric human teleportation. In ACM SIGGRAPH 2020 Real-Time Live!, pages 1–1. 2020.
- Tava: Template-free animatable volumetric actors. 2022.
- Tada! text to animatable digital avatars. arXiv preprint arXiv:2308.10899, 2023.
- Dynvideo-e: Harnessing dynamic nerf for large-scale motion- and view-change human-centric video editing, 2023.
- Neural sparse voxel fields. NeurIPS, 33:15651–15663, 2020.
- Neural actor: Neural free-view synthesis of human actors with pose control. ACM Trans. Graph.(ACM SIGGRAPH Asia), 2021.
- SMPL: A skinned multi-person linear model. ACM Trans. Graphics (Proc. SIGGRAPH Asia), 34(6):248:1–248:16, 2015.
- Avatarstudio: Text-driven editing of 3d dynamic human head avatars. ACM Trans. Graph., 42(6), 2023.
- Nerf: Representing scenes as neural radiance fields for view synthesis. In ECCV, 2020.
- Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1):99–106, 2021.
- Improved denoising diffusion probabilistic models, 2021.
- Neural articulated radiance field. In ICCV, 2021.
- Star: Sparse trained articulated human body regressor. In ECCV, pages 598–613, 2020.
- Localizing object-level shape variations with text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 23051–23061, 2023.
- Expressive body capture: 3d hands, face, and body from a single image. In CVPR, pages 10975–10985, 2019.
- Animatable neural radiance fields for modeling dynamic human bodies. In ICCV, pages 14314–14323, 2021.
- Dreamfusion: Text-to-3d using 2d diffusion. arXiv, 2022.
- Improving language understanding by generative pre-training. 2018.
- High-resolution image synthesis with latent diffusion models. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10674–10685, 2021.
- High-resolution image synthesis with latent diffusion models. In Computer Vision and Pattern Recognition (CVPR), 2022.
- Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. 2022.
- Background matting: The world is your green screen. In CVPR, 2020.
- Control4d: Dynamic portrait editing by learning 4d gan from 2d diffusion-based editor. arXiv preprint arXiv:2305.20082, 2023.
- Textured neural avatars. In CVPR, pages 2387–2397, 2019.
- Deepvoxels: Learning persistent 3d feature embeddings. In CVPR, 2019a.
- Scene representation networks: Continuous 3d-structure-aware neural scene representations. In NeurIPS, 2019b.
- A-nerf: Articulated neural radiance fields for learning human shape, appearance, and pose. NeurIPS, 34:12278–12291, 2021.
- Danbo: Disentangled articulated neural body representations via graph neural networks. In ECCV, 2022.
- TheCaptury. The Captury. http://www.thecaptury.com/, 2020.
- Attention is all you need. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2017.
- Optimal representation of multiple view video. In BMVC. BMVA Press, 2014.
- Neus: Learning neural implicit surfaces by volume rendering for multi-view reconstruction. Advances in Neural Information Processing Systems, 34:27171–27183, 2021.
- Arah: Animatable volume rendering of articulated human sdfs. In ECCV, 2022.
- Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation. In Advances in Neural Information Processing Systems (NeurIPS), 2023.
- Vid2actor: Free-viewpoint animatable person synthesis from video in the wild. arXiv preprint arXiv:2012.12884, 2020.
- Zeroavatar: Zero-shot 3d avatar generation from a single image, 2023.
- H-nerf: Neural radiance fields for rendering and temporal reconstruction of humans in motion. NeurIPS, 34:14955–14966, 2021.
- Have-fun: Human avatar reconstruction from few-shot unconstrained images. arXiv:2311.15672, 2023.
- Avatarverse: High-quality & stable 3d avatar creation from text and pose, 2023a.
- Nerf++: Analyzing and improving neural radiance fields. arXiv preprint arXiv:2010.07492, 2020.
- Adding conditional control to text-to-image diffusion models, 2023b.
- Sine: Single image editing with text-to-image diffusion models. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6027–6037, 2022.
- Avatarrex: Real-time expressive full-body avatars. ACM TOG, 42(4), 2023.
- Repaint-nerf: Nerf editting via semantic masks and diffusion models, 2023.
- Trihuman : A real-time and controllable tri-plane representation for detailed human geometry and appearance synthesis, 2023.
- Hifa: High-fidelity text-to-3d generation with advanced diffusion guidance, 2023.