- The paper introduces a text-driven pipeline that transforms single-view video into toonified head avatars using a conditional embedding Tri-plane.
- It employs a novel Tri-plane Gaussian deformation field to capture nonlinear facial motions and improve animation accuracy.
- The real-time system achieves 48 FPS on GPUs and 15–18 FPS on mobile devices, enabling practical applications in AR, gaming, and virtual meetings.
TextToon: Real-Time Text Toonify Head Avatar from Single Video
The presented paper introduces a novel method named TextToon, designed to generate drivable toonified avatars from single video inputs through descriptive text inputs. This approach marks a significant advancement in the field of avatar generation by bridging the gap between text prompts and video-based facial stylization. The method builds upon existing geometrical approaches that rely on multi-view data and introduces a more versatile solution via single-view inputs.
Methodology
TextToon employs a conditional embedding Tri-plane to manage realistic and stylized facial representations within a Gaussian deformation field. This methodology overcomes the conventional limitations of multi-view setups and static texture embeddings in avatar generation. The adaptation of a real-time system is notable, offering the pipeline impressive performance metrics at 48 FPS on GPU systems and 15-18 FPS on mobile devices. The system accelerates execution through a two-phase training approach consisting of pre-training on realistic appearances followed by fine-tuning with stylized text-to-image model outputs. This dual-phase process optimizes the system for high-fidelity execution while maintaining real-time capability.
Technical Contributions
The paper's main contributions can be summarized as follows:
- Text-Driven Stylization: The integration of text as a primary driver for avatar appearance offers a user-friendly interaction paradigm that avoids the complexities of data pre-collection and annotation. This simplification is crucial for the broader adoption of stylized avatars in various consumer applications.
- Tri-plane Gaussian Deformation Field: The method utilizes a novel architectural framework for handling nonlinear facial motion intricacies, improving upon traditional linear 3DMM approaches. This innovation is pivotal in achieving high animation accuracy in facial representations.
- Real-Time Processing: The system's performance, achieving over 48 FPS, is a testament to the efficient implementation and optimization strategies employed, making real-time stylization feasible on edge computing devices.
Implications and Future Directions
Practically, TextToon has potential applications in social media augmented reality features, video game character creation, and virtual meeting avatars, among others. Theoretically, the use of Gaussian Splatting with conditional input embeddings opens new avenues for exploring more complex and nuanced stylizations in real-time settings.
Future research could extend this work by addressing the noted challenges in motion synchronization of head and shoulder movements, potentially through more sophisticated non-rigid deformation models or advanced skinning techniques. There is also a clear path for leveraging large-scale text-image datasets to further enhance the text-to-image stylization capabilities, thus broadening the stylistic range achievable by the system.
In conclusion, TextToon stands as an innovative advancement in real-time avatar generation, marrying practical applicability with robust theoretical underpinnings. Its emphasis on user-directed stylization through text prompts enriches the modality interaction for avatars, paving the way for more immersive digital personas.