CoIE: Chain-of-Instruct Editing for Multi-Attribute Face Manipulation (2312.07879v2)
Abstract: Current text-to-image editing models often encounter challenges with smoothly manipulating multiple attributes using a single instruction. Taking inspiration from the Chain-of-Thought prompting technique utilized in LLMs, we present an innovative concept known as Chain-of-Instruct Editing (CoIE), which enhances the capabilities of these models through step-by-step editing using a series of instructions. In particular, in the context of face manipulation, we leverage the contextual learning abilities of a pretrained LLM, such as GPT-4, to generate a sequence of instructions from the original input, utilizing a purpose-designed 1-shot template. To further improve the precision of each editing step, we conduct fine-tuning on the editing models using our self-constructed instruction-guided face editing dataset, Instruct-CelebA. And additionally, we incorporate a super-resolution module to mitigate the adverse effects of editability and quality degradation. Experimental results across various challenging cases confirm the significant boost in multi-attribute facial image manipulation using chain-of-instruct editing. This is evident in enhanced editing success rates, measured by CLIPSim and Coverage metrics, improved by 17.86% and 85.45% respectively, and heightened controllability indicated by Preserve L1 and Quality metrics, improved by 11.58% and 4.93% respectively.
- Blended Diffusion for Text-driven Editing of Natural Images. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 18187–18197.
- Text2LIVE: Text-Driven Layered Image and Video Editing. In Computer Vision – ECCV 2022, 707–723. Cham: Springer Nature Switzerland.
- InstructPix2Pix: Learning To Follow Image Editing Instructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 18392–18402.
- DreamIdentity: Improved Editability for Efficient Face-identity Preserved Image Generation. arXiv:2307.00300.
- VQGAN-CLIP: Open Domain Image Generation and Editing with Natural Language Guidance. In European Conference on Computer Vision.
- DiffusionRig: Learning Personalized Priors for Facial Appearance Editing. arXiv:2304.06711.
- Prompt-to-Prompt Image Editing with Cross-Attention Control. In The Eleventh International Conference on Learning Representations.
- Imagic: Text-Based Real Image Editing with Diffusion Models. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
- DiffusionCLIP: Text-Guided Diffusion Models for Robust Image Manipulation. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2416–2425.
- Large Language Models are Zero-Shot Reasoners. In Oh, A. H.; Agarwal, A.; Belgrave, D.; and Cho, K., eds., Advances in Neural Information Processing Systems.
- CLIPstyler: Image Style Transfer with a Single Text Condition. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 18041–18050.
- MaskGAN: Towards Diverse and Interactive Facial Image Manipulation. arXiv:1907.11922.
- BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. arXiv:2301.12597.
- Decoupled Weight Decay Regularization. arXiv:1711.05101.
- Cycle Encoding of a StyleGAN Encoder for Improved Reconstruction and Editability. In Proceedings of the 30th ACM International Conference on Multimedia, MM ’22, 2032–2041. New York, NY, USA: Association for Computing Machinery. ISBN 9781450392037.
- SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations. In International Conference on Learning Representations.
- A Diverse Corpus for Evaluating and Developing English Math Word Problem Solvers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, 975–984. Association for Computational Linguistics.
- Null-text Inversion for Editing Real Images using Guided Diffusion Models. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
- OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774.
- StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2065–2074.
- Learning Transferable Visual Models From Natural Language Supervision. In International Conference on Machine Learning.
- High-Resolution Image Synthesis with Latent Diffusion Models. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 10674–10685.
- Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding. ArXiv, abs/2205.11487.
- CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 4149–4158. Association for Computational Linguistics.
- SER-FIQ: Unsupervised Estimation of Face Image Quality Based on Stochastic Embedding Robustness. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 5650–5659.
- Self-Consistency Improves Chain of Thought Reasoning in Language Models. In The Eleventh International Conference on Learning Representations.
- Real-ESRGAN: Training Real-World Blind Super-Resolution with Pure Synthetic Data. arXiv:2107.10833.
- Chain of Thought Prompting Elicits Reasoning in Large Language Models. In Advances in Neural Information Processing Systems.
- From Continuity to Editability: Inverting GANs With Consecutive Images. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 13910–13918.
- ChatFace: Chat-Guided Real Face Editing via Diffusion Latent Space Manipulation. arXiv:2305.14742.
- MagicBrush: A Manually Annotated Dataset for Instruction-Guided Image Editing. arXiv:2306.10012.