Adaptive Super Resolution For One-Shot Talking-Head Generation (2403.15944v1)
Abstract: The one-shot talking-head generation learns to synthesize a talking-head video with one source portrait image under the driving of same or different identity video. Usually these methods require plane-based pixel transformations via Jacobin matrices or facial image warps for novel poses generation. The constraints of using a single image source and pixel displacements often compromise the clarity of the synthesized images. Some methods try to improve the quality of synthesized videos by introducing additional super-resolution modules, but this will undoubtedly increase computational consumption and destroy the original data distribution. In this work, we propose an adaptive high-quality talking-head video generation method, which synthesizes high-resolution video without additional pre-trained modules. Specifically, inspired by existing super-resolution methods, we down-sample the one-shot source image, and then adaptively reconstruct high-frequency details via an encoder-decoder module, resulting in enhanced video clarity. Our method consistently improves the quality of generated videos through a straightforward yet effective strategy, substantiated by quantitative and qualitative evaluations. The code and demo video are available on: \url{https://github.com/Songluchuan/AdaSR-TalkingHead/}.
- “Head2head: Video-based neural head synthesis,” in 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020). IEEE, 2020, pp. 16–23.
- “Everybody’s talkin’: Let me talk as you want,” IEEE Transactions on Information Forensics and Security, vol. 17, pp. 585–598, 2022.
- “Tacr-net: editing on deep video and voice portraits,” in Proceedings of the 29th ACM International Conference on Multimedia, 2021, pp. 478–486.
- “Talking face video generation with editable expression,” in Image and Graphics: 11th International Conference, ICIG 2021, Haikou, China, August 6–8, 2021, Proceedings, Part III 11. Springer, 2021, pp. 753–764.
- “First order motion model for image animation,” Advances in neural information processing systems, vol. 32, 2019.
- “Thin-plate spline motion model for image animation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 3657–3666.
- “One-shot free-view neural talking-head synthesis for video conferencing,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 10039–10049.
- “Emotional listener portrait: Neural listener head generation with emotion,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 20839–20849.
- “Fsft-net: face transfer video generation with few-shot views,” in 2021 IEEE International Conference on Image Processing (ICIP). IEEE, 2021, pp. 3582–3586.
- “Voxceleb: a large-scale speaker identification dataset,” arXiv preprint arXiv:1706.08612, 2017.
- “Voxceleb2: Deep speaker recognition,” arXiv preprint arXiv:1806.05622, 2018.
- “Metaportrait: Identity-preserving talking head generation with fast personalized adaptation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 22096–22105.
- “Sadtalker: Learning realistic 3d motion coefficients for stylized audio-driven single image talking face animation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 8652–8661.
- “Videoretalking: Audio-based lip synchronization for talking head video editing in the wild,” in SIGGRAPH Asia 2022 Conference Papers, 2022, pp. 1–9.
- “Esrgan: Enhanced super-resolution generative adversarial networks,” in Proceedings of the European conference on computer vision (ECCV) workshops, 2018, pp. 0–0.
- “Real-esrgan: Training real-world blind super-resolution with pure synthetic data,” in International Conference on Computer Vision Workshops (ICCVW).
- “Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 3661–3670.
- “Semantic image synthesis with spatially-adaptive normalization,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019.
- “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
- “Few-shot video-to-video synthesis,” arXiv preprint arXiv:1910.12713, 2019.
- “Video-to-video synthesis,” arXiv preprint arXiv:1808.06601, 2018.
- “Celebv-hq: A large-scale video facial attributes dataset,” in European conference on computer vision. Springer, 2022, pp. 650–667.
- “A style-based generator architecture for generative adversarial networks,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 4401–4410.
- “Fine-grained head pose estimation without keypoints,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2018.