Portrait Video Editing Empowered by Multimodal Generative Priors (2409.13591v1)

Published 20 Sep 2024 in cs.CV and cs.GR

Abstract: We introduce PortraitGen, a powerful portrait video editing method that achieves consistent and expressive stylization with multimodal prompts. Traditional portrait video editing methods often struggle with 3D and temporal consistency, and typically lack in rendering quality and efficiency. To address these issues, we lift the portrait video frames to a unified dynamic 3D Gaussian field, which ensures structural and temporal coherence across frames. Furthermore, we design a novel Neural Gaussian Texture mechanism that not only enables sophisticated style editing but also achieves rendering speed over 100FPS. Our approach incorporates multimodal inputs through knowledge distilled from large-scale 2D generative models. Our system also incorporates expression similarity guidance and a face-aware portrait editing module, effectively mitigating degradation issues associated with iterative dataset updates. Extensive experiments demonstrate the temporal consistency, editing efficiency, and superior rendering quality of our method. The broad applicability of the proposed approach is demonstrated through various applications, including text-driven editing, image-driven editing, and relighting, highlighting its great potential to advance the field of video editing. Demo videos and released code are provided in our project page: https://ustc3dv.github.io/PortraitGen/

Citations (2)

View on Semantic Scholar

Summary

The paper introduces PortraitGen, a method that transitions portrait video editing from 2D to a dynamic 3D domain using a unified 3D Gaussian field.
It leverages neural Gaussian textures and multimodal inputs to achieve fast (100+ FPS) and style-consistent editing while addressing 3D and temporal coherence challenges.
Experimental validations demonstrate that PortraitGen outperforms existing methods in prompt preservation, identity accuracy, and temporal consistency.

Portrait Video Editing Empowered by Multimodal Generative Priors

The paper "Portrait Video Editing Empowered by Multimodal Generative Priors" introduces PortraitGen, a method for high-quality portrait video editing that leverages multimodal prompts. Traditional methods in portrait video editing frequently encounter challenges pertaining to 3D and temporal consistency, often falling short in rendering quality and operational efficiency. This research addresses these challenges by transitioning the editing framework from a 2D to a 3D domain, employing a unified dynamic 3D Gaussian field.

Methodology and Technical Innovations

Central to PortraitGen is the integration of a novel Neural Gaussian Texture mechanism that significantly enhances style editing capabilities while achieving rendering speeds exceeding 100 FPS. The approach begins by lifting portrait video frames into the 3D domain using 3D Gaussian Splatting (3DGS). Embedding a 3D Gaussian field on the SMPL-X model, the system ensures structural and temporal coherence across frames, thus addressing the core issues of previous methods. The Neural Gaussian Texture, inspired by neural texturing techniques, allows for more sophisticated style representations by transforming splatted feature maps into RGB signals via a 2D neural renderer.

The authors further advance the editing process by incorporating multimodal inputs through knowledge distilled from large-scale 2D generative models. These models enable text-driven editing, image-driven editing, and relighting, positioning PortraitGen as a versatile tool in the field of video editing. An expression similarity guidance mechanism and a face-aware portrait editing module specifically target the issues of degradation typically encountered with iterative dataset updates, ensuring that personalized facial structures and expressions remain consistent and accurate.

Experimental Validation and Applications

The efficacy of PortraitGen is extensively demonstrated through experiments highlighting its temporal consistency, editing efficiency, and superior rendering quality. Comparative analyses with methods such as TokenFlow, CoDeF, and AnyV2V reveal PortraitGen's significant advantages in maintaining 3D and temporal consistency while adhering more closely to editing prompts. In user studies, PortraitGen consistently outperforms other methods across metrics such as prompt preservation, identity preservation, temporal consistency, and overall quality, as indicated by participant preferences.

Applications of the proposed methodology span various domains including text-driven portrait editing, image-driven editing, and dynamic lighting adjustments. The capability to perform intricate stylistic edits, such as converting portraits into Lego or pixel art styles, showcases its robustness and flexibility.

Implications and Future Directions

This research presents substantial theoretical and practical implications. Theoretically, it underscores the importance of integrating 3D domain knowledge into video editing frameworks, paving the way for future explorations into more coherent and efficient editing methodologies. Practically, the significant improvements in rendering speed and quality suggest wide applicability in areas such as film production, virtual reality, and augmented reality.

Looking forward, future developments could focus on enhancing the robustness of the SMPL-X tracking system to minimize errors in expression and posture transformations. Moreover, the continuous evolution of 2D generative models promises further improvements in the editing capabilities of systems like PortraitGen. Expanding the scope of multimodal inputs and refining the fidelity of complex styles will undoubtedly enrich the potential applications of this approach.

In conclusion, the paper offers a well-articulated and technically rigorous framework for portrait video editing, contributing valuable insights and tools to the computer graphics and AI communities.

PDF Markdown

Related Papers

GitHub

PortraitGen

Tweets

https://twitter.com/_akhaliq/status/1838032328188375522

https://twitter.com/javaeeeee1/status/1838213590429417977

https://twitter.com/susumuota/status/1843805354125152634