Lipschitz-Driven Noise Robustness in VQ-AE for High-Frequency Texture Repair in ID-Specific Talking Heads (2410.00990v3)
Abstract: Audio-driven IDentity-specific Talking Head Generation (ID-specific THG) has shown increasing promise for applications in filmmaking and virtual reality. Existing approaches are generally constructed as end-to-end paradigms, and have achieved significant progress. However, they often struggle to capture high-frequency textures due to limited model capacity. To address these limitations, we adopt a simple yet efficient post-processing framework -- unlike previous studies that focus solely on end-to-end training -- guided by our theoretical insights. Specifically, leveraging the \textit{Lipschitz Continuity Theory} of neural networks, we prove a crucial noise tolerance property for the Vector Quantized AutoEncoder (VQ-AE), and establish the existence of a Noise Robustness Upper Bound (NRoUB). This insight reveals that we can efficiently obtain an identity-specific denoiser by training an identity-specific neural discrete representation, without requiring an extra network. Based on this theoretical foundation, we propose a plug-and-play Space-Optimized VQ-AE (SOVQAE) with enhanced NRoUB to achieve temporally-consistent denoising. For practical deployment, we further introduce a cascade pipeline combining a pretrained Wav2Lip model with SOVQAE to perform ID-specific THG. Our experiments demonstrate that this pipeline achieves \textit{state-of-the-art} performance in video quality and robustness for out-of-distribution lip synchronization, surpassing existing identity-specific THG methods. In addition, the pipeline requires only a couple of consumer GPU hours and runs in real time, which is both efficient and practical for industry applications.
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.