S3Editor: A Sparse Semantic-Disentangled Self-Training Framework for Face Video Editing (2404.08111v1)
Abstract: Face attribute editing plays a pivotal role in various applications. However, existing methods encounter challenges in achieving high-quality results while preserving identity, editing faithfulness, and temporal consistency. These challenges are rooted in issues related to the training pipeline, including limited supervision, architecture design, and optimization strategy. In this work, we introduce S3Editor, a Sparse Semantic-disentangled Self-training framework for face video editing. S3Editor is a generic solution that comprehensively addresses these challenges with three key contributions. Firstly, S3Editor adopts a self-training paradigm to enhance the training process through semi-supervision. Secondly, we propose a semantic disentangled architecture with a dynamic routing mechanism that accommodates diverse editing requirements. Thirdly, we present a structured sparse optimization schema that identifies and deactivates malicious neurons to further disentangle impacts from untarget attributes. S3Editor is model-agnostic and compatible with various editing approaches. Our extensive qualitative and quantitative results affirm that our approach significantly enhances identity preservation, editing fidelity, as well as temporal consistency.
- How slow is the k-means method? In Proceedings of the twenty-second annual symposium on Computational geometry, pages 144–153, 2006.
- Align your latents: High-resolution video synthesis with latent diffusion models. In CVPR, 2023.
- Instructpix2pix: Learning to follow image editing instructions. In CVPR, 2023.
- Pix2video: Video editing using image diffusion. In ICCV, 2023.
- Stablevideo: Text-driven consistency-aware diffusion video editing. In ICCV, 2023.
- Only train once: A one-shot neural network training and pruning framework. Advances in Neural Information Processing Systems, 34:19637–19651, 2021.
- Otov2: Automatic, generic, user-friendly. arXiv preprint arXiv:2303.06862, 2023.
- An adaptive half-space projection method for stochastic optimization problems with group sparse regularization. TMLR, 2023.
- Arcface: Additive angular margin loss for deep face recognition. In CVPR, 2019.
- Diffusion models beat gans on image synthesis. NeurIPS, 2021.
- Sparsity-guided network design for frame interpolation. arXiv preprint arXiv:2209.04551, 2022.
- Style aggregated network for facial landmark detection. In CVPR, 2018.
- Tokenflow: Consistent diffusion features for consistent video editing. arXiv preprint arXiv:2307.10373, 2023.
- Attgan: Facial attribute editing by only changing what you want. IEEE TIP, 2019.
- Prompt-to-prompt image editing with cross-attention control. In ICLR, 2022.
- Denoising diffusion probabilistic models. NeurIPS, 2020.
- Generative adversarial network for handwritten text. arXiv preprint arXiv:1907.11845, 2019.
- A style-based generator architecture for generative adversarial networks. In CVPR, 2019.
- Training generative adversarial networks with limited data. In NeurIPS, 2020.
- Alias-free generative adversarial networks. In NeurIPS, 2021.
- Diffusion video autoencoders: Toward temporally consistent face video editing via disentangled video encoding. In CVPR, 2023.
- Deepfacevideoediting: sketch-based deep editing of face videos. TOG, 2022.
- Null-text inversion for editing real images using guided diffusion models. In CVPR, 2023.
- Dreamix: Video diffusion models are general video editors. arXiv preprint arXiv:2302.01329, 2023.
- Voxceleb: A large-scale speaker identification dataset. Interspeech, 2017.
- Improved denoising diffusion probabilistic models. In ICML, 2021.
- Zero-shot image-to-image translation. In ACM SIGGRAPH, 2023.
- Styleclip: Text-driven manipulation of stylegan imagery. In CVPR, 2021.
- Diffusion autoencoders: Toward a meaningful and decodable representation. In CVPR, 2022.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- Diffusion variational autoencoders. arXiv preprint arXiv:1901.08991, 2019.
- Encoding in style: a stylegan encoder for image-to-image translation. In CVPR, 2021.
- U-net: Convolutional networks for biomedical image segmentation. In MICCAI. Springer, 2015.
- Interfacegan: Interpreting the disentangled face representation learned by gans. TPAMI, 2020.
- Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792, 2022.
- Denoising diffusion implicit models. In ICLR, 2020.
- Designing an encoder for stylegan image manipulation. ACM TOG, 2021.
- Stitch it in time: Gan-based facial editing of real videos. In SIGGRAPH Asia, 2022.
- Gan inversion: A survey. IEEE TPAMI.
- Temporally consistent semantic video editing. In ECCV, 2022.
- Styleganex: Stylegan-based manipulation beyond cropped aligned faces. In ICCV, 2023a.
- Rerender a video: Zero-shot text-guided video-to-video translation. arXiv preprint arXiv:2306.07954, 2023b.
- A latent transformer for disentangled face editing in images and videos. ICCV, 2021.
- Generating videos with dynamics-aware implicit generative adversarial networks. arXiv preprint arXiv:2202.10571, 2022.
- Multimodal image synthesis and editing: A survey and taxonomy. IEEE TPAMI.
- Text-to-image diffusion model in generative ai: A survey. arXiv preprint arXiv:2303.07909, 2023.
- Generative adversarial network with spatial attention for face attribute editing. In ECCV, 2018.
- Facial landmark detection by deep multi-task learning. In ECCV, 2014.
- Dream: Diffusion rectification and estimation-adaptive models. arXiv preprint arXiv:2312.00210, 2023.
- In-domain gan inversion for real image editing. In ECCV.
- Guangzhi Wang (17 papers)
- Tianyi Chen (139 papers)
- Kamran Ghasedi (3 papers)
- HsiangTao Wu (8 papers)
- Tianyu Ding (36 papers)
- Chris Nuesmeyer (1 paper)
- Ilya Zharkov (25 papers)
- Mohan Kankanhalli (117 papers)
- Luming Liang (27 papers)