TextToon: Real-Time Text Toonify Head Avatar from Single Video (2410.07160v1)

Published 23 Sep 2024 in cs.CV and cs.GR

Abstract: We propose TextToon, a method to generate a drivable toonified avatar. Given a short monocular video sequence and a written instruction about the avatar style, our model can generate a high-fidelity toonified avatar that can be driven in real-time by another video with arbitrary identities. Existing related works heavily rely on multi-view modeling to recover geometry via texture embeddings, presented in a static manner, leading to control limitations. The multi-view video input also makes it difficult to deploy these models in real-world applications. To address these issues, we adopt a conditional embedding Tri-plane to learn realistic and stylized facial representations in a Gaussian deformation field. Additionally, we expand the stylization capabilities of 3D Gaussian Splatting by introducing an adaptive pixel-translation neural network and leveraging patch-aware contrastive learning to achieve high-quality images. To push our work into consumer applications, we develop a real-time system that can operate at 48 FPS on a GPU machine and 15-18 FPS on a mobile machine. Extensive experiments demonstrate the efficacy of our approach in generating textual avatars over existing methods in terms of quality and real-time animation. Please refer to our project page for more details: https://songluchuan.github.io/TextToon/.

Citations (1)

View on Semantic Scholar

Summary

The paper introduces a text-driven pipeline that transforms single-view video into toonified head avatars using a conditional embedding Tri-plane.
It employs a novel Tri-plane Gaussian deformation field to capture nonlinear facial motions and improve animation accuracy.
The real-time system achieves 48 FPS on GPUs and 15–18 FPS on mobile devices, enabling practical applications in AR, gaming, and virtual meetings.

TextToon: Real-Time Text Toonify Head Avatar from Single Video

The presented paper introduces a novel method named TextToon, designed to generate drivable toonified avatars from single video inputs through descriptive text inputs. This approach marks a significant advancement in the field of avatar generation by bridging the gap between text prompts and video-based facial stylization. The method builds upon existing geometrical approaches that rely on multi-view data and introduces a more versatile solution via single-view inputs.

Methodology

TextToon employs a conditional embedding Tri-plane to manage realistic and stylized facial representations within a Gaussian deformation field. This methodology overcomes the conventional limitations of multi-view setups and static texture embeddings in avatar generation. The adaptation of a real-time system is notable, offering the pipeline impressive performance metrics at 48 FPS on GPU systems and 15-18 FPS on mobile devices. The system accelerates execution through a two-phase training approach consisting of pre-training on realistic appearances followed by fine-tuning with stylized text-to-image model outputs. This dual-phase process optimizes the system for high-fidelity execution while maintaining real-time capability.

Technical Contributions

The paper's main contributions can be summarized as follows:

Text-Driven Stylization: The integration of text as a primary driver for avatar appearance offers a user-friendly interaction paradigm that avoids the complexities of data pre-collection and annotation. This simplification is crucial for the broader adoption of stylized avatars in various consumer applications.
Tri-plane Gaussian Deformation Field: The method utilizes a novel architectural framework for handling nonlinear facial motion intricacies, improving upon traditional linear 3DMM approaches. This innovation is pivotal in achieving high animation accuracy in facial representations.
Real-Time Processing: The system's performance, achieving over 48 FPS, is a testament to the efficient implementation and optimization strategies employed, making real-time stylization feasible on edge computing devices.

Implications and Future Directions

Practically, TextToon has potential applications in social media augmented reality features, video game character creation, and virtual meeting avatars, among others. Theoretically, the use of Gaussian Splatting with conditional input embeddings opens new avenues for exploring more complex and nuanced stylizations in real-time settings.

Future research could extend this work by addressing the noted challenges in motion synchronization of head and shoulder movements, potentially through more sophisticated non-rigid deformation models or advanced skinning techniques. There is also a clear path for leveraging large-scale text-image datasets to further enhance the text-to-image stylization capabilities, thus broadening the stylistic range achievable by the system.

In conclusion, TextToon stands as an innovative advancement in real-time avatar generation, marrying practical applicability with robust theoretical underpinnings. Its emphasis on user-directed stylization through text prompts enriches the modality interaction for avatars, paving the way for more immersive digital personas.

PDF Markdown

Related Papers

GitHub

https://songluchuan.github.io/TextToon/

Tweets

https://twitter.com/_akhaliq/status/1844270669623972097

https://twitter.com/janusch_patas/status/1844340498124714149

https://twitter.com/arXivGPT/status/1844867131629859094