Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

TEDRA: Text-based Editing of Dynamic and Photoreal Actors (2408.15995v1)

Published 28 Aug 2024 in cs.CV

Abstract: Over the past years, significant progress has been made in creating photorealistic and drivable 3D avatars solely from videos of real humans. However, a core remaining challenge is the fine-grained and user-friendly editing of clothing styles by means of textual descriptions. To this end, we present TEDRA, the first method allowing text-based edits of an avatar, which maintains the avatar's high fidelity, space-time coherency, as well as dynamics, and enables skeletal pose and view control. We begin by training a model to create a controllable and high-fidelity digital replica of the real actor. Next, we personalize a pretrained generative diffusion model by fine-tuning it on various frames of the real character captured from different camera angles, ensuring the digital representation faithfully captures the dynamics and movements of the real person. This two-stage process lays the foundation for our approach to dynamic human avatar editing. Utilizing this personalized diffusion model, we modify the dynamic avatar based on a provided text prompt using our Personalized Normal Aligned Score Distillation Sampling (PNA-SDS) within a model-based guidance framework. Additionally, we propose a time step annealing strategy to ensure high-quality edits. Our results demonstrate a clear improvement over prior work in functionality and visual quality.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (75)
  1. Generative neural articulated radiance fields. NeurIPS, 35:19900–19916, 2022.
  2. Instructpix2pix: Learning to follow image editing instructions. In CVPR, 2023a.
  3. Instructpix2pix: Learning to follow image editing instructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18392–18402, 2023b.
  4. Dreamavatar: Text-and-shape guided 3d human avatar generation via diffusion models. arXiv preprint arXiv:2304.00916, 2023.
  5. 4d video textures for interactive character appearance. Comput. Graph. Forum, 33(2):371–380, 2014.
  6. Animatable neural radiance fields from monocular rgb videos. arXiv preprint arXiv:2106.13629, 2021.
  7. Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023.
  8. CompVis. Stable diffusion, 2022.
  9. Capturing and animation of body and clothing from monocular video. In SIGGRAPH Asia 2022 Conference Papers, 2022.
  10. Neural novel actor: Learning a generalized animatable neural representation for human actors. IEEE TVCG, 2023.
  11. Deepcap: Monocular human performance capture using weak supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5052–5063, 2020.
  12. Real-time deep dynamic characters. ACM TOG, 40(4), 2021.
  13. Hdhumans: A hybrid approach for high-fidelity digital humans. Proceedings of the ACM on Computer Graphics and Interactive Techniques, 6(3):1–23, 2023.
  14. Instruct-nerf2nerf: Editing 3d scenes with instructions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023.
  15. Escaping plato’s cave: 3d shape from adversarial rendering. In CVPR, pages 9984–9993, 2019.
  16. Classifier-free diffusion guidance, 2022.
  17. Denoising diffusion probabilistic models, 2020.
  18. Avatarclip: Zero-shot text-driven generation and animation of 3d avatars. ACM Transactions on Graphics (TOG), 41(4):1–19, 2022.
  19. Sherf: Generalizable human nerf from a single image. arXiv preprint, 2023.
  20. Humannorm: Learning normal diffusion model for high-quality and realistic 3d human generation. arXiv, 2023a.
  21. Dreamwaltz: Make a scene with complex 3d animatable avatars. 2023b.
  22. Tech: Text-guided reconstruction of lifelike clothed humans, 2023c.
  23. Zero-shot text-guided object generation with dream fields. CVPR, 2022.
  24. Avatarcraft: Transforming text into neural human avatars with parameterized shape and pose control, 2023a.
  25. In CVPR, pages 16922–16932, 2023b.
  26. Total capture: A 3d deformation model for tracking faces, hands, and bodies. In CVPR, pages 8320–8329, 2018.
  27. Skinning with dual quaternions. In Proceedings of the 2007 symposium on Interactive 3D graphics and games, pages 39–46, 2007.
  28. Dreamhuman: Animatable 3d avatars from text. 2023.
  29. Deliffas: Deformable light fields for fast avatar synthesis. NeurIPS, 2023.
  30. Volumetric human teleportation. In ACM SIGGRAPH 2020 Real-Time Live!, pages 1–1. 2020.
  31. Tava: Template-free animatable volumetric actors. 2022.
  32. Tada! text to animatable digital avatars. arXiv preprint arXiv:2308.10899, 2023.
  33. Dynvideo-e: Harnessing dynamic nerf for large-scale motion- and view-change human-centric video editing, 2023.
  34. Neural sparse voxel fields. NeurIPS, 33:15651–15663, 2020.
  35. Neural actor: Neural free-view synthesis of human actors with pose control. ACM Trans. Graph.(ACM SIGGRAPH Asia), 2021.
  36. SMPL: A skinned multi-person linear model. ACM Trans. Graphics (Proc. SIGGRAPH Asia), 34(6):248:1–248:16, 2015.
  37. Avatarstudio: Text-driven editing of 3d dynamic human head avatars. ACM Trans. Graph., 42(6), 2023.
  38. Nerf: Representing scenes as neural radiance fields for view synthesis. In ECCV, 2020.
  39. Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1):99–106, 2021.
  40. Improved denoising diffusion probabilistic models, 2021.
  41. Neural articulated radiance field. In ICCV, 2021.
  42. Star: Sparse trained articulated human body regressor. In ECCV, pages 598–613, 2020.
  43. Localizing object-level shape variations with text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 23051–23061, 2023.
  44. Expressive body capture: 3d hands, face, and body from a single image. In CVPR, pages 10975–10985, 2019.
  45. Animatable neural radiance fields for modeling dynamic human bodies. In ICCV, pages 14314–14323, 2021.
  46. Dreamfusion: Text-to-3d using 2d diffusion. arXiv, 2022.
  47. Improving language understanding by generative pre-training. 2018.
  48. High-resolution image synthesis with latent diffusion models. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10674–10685, 2021.
  49. High-resolution image synthesis with latent diffusion models. In Computer Vision and Pattern Recognition (CVPR), 2022.
  50. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. 2022.
  51. Background matting: The world is your green screen. In CVPR, 2020.
  52. Control4d: Dynamic portrait editing by learning 4d gan from 2d diffusion-based editor. arXiv preprint arXiv:2305.20082, 2023.
  53. Textured neural avatars. In CVPR, pages 2387–2397, 2019.
  54. Deepvoxels: Learning persistent 3d feature embeddings. In CVPR, 2019a.
  55. Scene representation networks: Continuous 3d-structure-aware neural scene representations. In NeurIPS, 2019b.
  56. A-nerf: Articulated neural radiance fields for learning human shape, appearance, and pose. NeurIPS, 34:12278–12291, 2021.
  57. Danbo: Disentangled articulated neural body representations via graph neural networks. In ECCV, 2022.
  58. TheCaptury. The Captury. http://www.thecaptury.com/, 2020.
  59. Attention is all you need. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2017.
  60. Optimal representation of multiple view video. In BMVC. BMVA Press, 2014.
  61. Neus: Learning neural implicit surfaces by volume rendering for multi-view reconstruction. Advances in Neural Information Processing Systems, 34:27171–27183, 2021.
  62. Arah: Animatable volume rendering of articulated human sdfs. In ECCV, 2022.
  63. Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation. In Advances in Neural Information Processing Systems (NeurIPS), 2023.
  64. Vid2actor: Free-viewpoint animatable person synthesis from video in the wild. arXiv preprint arXiv:2012.12884, 2020.
  65. Zeroavatar: Zero-shot 3d avatar generation from a single image, 2023.
  66. H-nerf: Neural radiance fields for rendering and temporal reconstruction of humans in motion. NeurIPS, 34:14955–14966, 2021.
  67. Have-fun: Human avatar reconstruction from few-shot unconstrained images. arXiv:2311.15672, 2023.
  68. Avatarverse: High-quality & stable 3d avatar creation from text and pose, 2023a.
  69. Nerf++: Analyzing and improving neural radiance fields. arXiv preprint arXiv:2010.07492, 2020.
  70. Adding conditional control to text-to-image diffusion models, 2023b.
  71. Sine: Single image editing with text-to-image diffusion models. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6027–6037, 2022.
  72. Avatarrex: Real-time expressive full-body avatars. ACM TOG, 42(4), 2023.
  73. Repaint-nerf: Nerf editting via semantic masks and diffusion models, 2023.
  74. Trihuman : A real-time and controllable tri-plane representation for detailed human geometry and appearance synthesis, 2023.
  75. Hifa: High-fidelity text-to-3d generation with advanced diffusion guidance, 2023.

Summary

  • The paper introduces a novel two-stage approach that leverages a pre-trained avatar model and fine-tuned diffusion for detailed text-based editing.
  • It employs Personalized Normal Aligned Score Distillation and a time step annealing strategy to ensure visual fidelity and temporal coherence.
  • Experimental evaluations demonstrate TEDRA's superior performance on CLIP similarity and FID metrics, highlighting its potential in AR/VR and synthetic content generation.

An Insightful Overview of TEDRA: Text-based Editing of Dynamic and Photoreal Actors

In this essay, we delve into the notable work titled "TEDRA: Text-based Editing of Dynamic and Photoreal Actors" authored by researchers from the Max Planck Institute for Informatics and the University of Freiburg. This paper addresses persistent challenges in the domain of 3D avatar manipulation, particularly the task of text-based fine-grained editing of photorealistic dynamic avatars, by proposing a method named TEDRA.

Introduction and Problem Statement

Over recent years, significant progress has been made in generating animated, high-fidelity 3D avatars from video data. Despite these advances, the detailed editing of these avatars using natural language descriptions has remained a formidable challenge. The accurate translation of textual descriptions into avatar modifications, while preserving the original spatio-temporal coherence, visual consistency, and fidelity, calls for sophisticated approaches that combine the strengths of neural rendering and generative models.

Technical Approach

The presented approach, TEDRA, comprises two primary stages. Initially, a controllable, high-fidelity digital replica of the actor is created by leveraging a pre-trained avatar model - in this case, TriHuman - known for its ability to represent dynamic human geometry and appearance comprehensively. This model serves as the foundational avatar representation.

Following the establishment of the base avatar, the pre-trained generative diffusion model is fine-tuned on a multitude of frames capturing the subject from diverse angles. This stage ensures the generated digital character faithfully mirrors the dynamic and intricate details of the real actor. Fine-tuning involves the introduction of a Personalized Normal Aligned Score Distillation Sampling (PNA-SDS) within a model-based guidance framework aimed at facilitating text-based modifications.

To maintain the visual and temporal consistency of the edited avatar, TEDRA introduces a time step annealing strategy during the fine-tuning process. This strategy, combined with the PNA-SDS method, orchestrates the process of gradual, controlled adaptations to the avatar while honoring the nuanced clothing and movement details specified in the text prompt.

Experimental Evaluation

The empirical evaluations in the paper substantiate significant improvements over preceding methods regarding functionality and visual quality. By employing qualitative and quantitative assessments, including user studies and specific metrics (e.g., CLIP text-image direction similarity and FID scores), the paper demonstrates that TEDRA can consistently produce higher quality edits while preserving the initial character identity and dynamic features. The user paper, for instance, revealed a decisive preference for TEDRA across multiple fronts, including subject consistency, prompt preservation, and temporal coherence.

Implications and Future Directions

TEDRA's contributions are multifaceted. Practically, it offers an intuitive interface for professionals in the domains of AR/VR, gaming, and synthetic data generation, who can now effortlessly manipulate avatars using natural language descriptions. This opens up new possibilities for personalized and dynamic content creation, enabling more interactive and realistic virtual experiences.

Theoretically, TEDRA paves the way for further explorations into the seamless integration of large-scale diffusion models with 3D avatar representations. The innovative use of PNA-SDS and the windowed time-step annealing strategy highlight promising directions for improving text-to-3D editing mechanisms. Future research may focus on refining these strategies to handle even more detailed and realistic edits, expanding the scope of applications to cover a broader range of dynamic cloth and skin deformations.

Moreover, improvements could aim at optimizing the computational efficiency to allow for faster training and fine-tuning processes, making the technology more accessible. Potential extensions might also explore training models with monocular data inputs, reducing the reliance on multi-camera setups and thereby democratizing access to high-fidelity 3D avatar editing.

Conclusion

TEDRA represents a significant advancement in the field of 3D avatar manipulation, blending neural rendering with sophisticated generative models to achieve intuitively controlled, highly detailed, and temporally coherent avatar edits. By addressing the core challenges of maintaining spatio-temporal consistency and high fidelity, TEDRA not only enhances current capabilities but also sets the stage for future innovations in AI-driven content creation.