Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
175 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CoIE: Chain-of-Instruct Editing for Multi-Attribute Face Manipulation (2312.07879v2)

Published 13 Dec 2023 in cs.CV and cs.AI

Abstract: Current text-to-image editing models often encounter challenges with smoothly manipulating multiple attributes using a single instruction. Taking inspiration from the Chain-of-Thought prompting technique utilized in LLMs, we present an innovative concept known as Chain-of-Instruct Editing (CoIE), which enhances the capabilities of these models through step-by-step editing using a series of instructions. In particular, in the context of face manipulation, we leverage the contextual learning abilities of a pretrained LLM, such as GPT-4, to generate a sequence of instructions from the original input, utilizing a purpose-designed 1-shot template. To further improve the precision of each editing step, we conduct fine-tuning on the editing models using our self-constructed instruction-guided face editing dataset, Instruct-CelebA. And additionally, we incorporate a super-resolution module to mitigate the adverse effects of editability and quality degradation. Experimental results across various challenging cases confirm the significant boost in multi-attribute facial image manipulation using chain-of-instruct editing. This is evident in enhanced editing success rates, measured by CLIPSim and Coverage metrics, improved by 17.86% and 85.45% respectively, and heightened controllability indicated by Preserve L1 and Quality metrics, improved by 11.58% and 4.93% respectively.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (31)
  1. Blended Diffusion for Text-driven Editing of Natural Images. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 18187–18197.
  2. Text2LIVE: Text-Driven Layered Image and Video Editing. In Computer Vision – ECCV 2022, 707–723. Cham: Springer Nature Switzerland.
  3. InstructPix2Pix: Learning To Follow Image Editing Instructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 18392–18402.
  4. DreamIdentity: Improved Editability for Efficient Face-identity Preserved Image Generation. arXiv:2307.00300.
  5. VQGAN-CLIP: Open Domain Image Generation and Editing with Natural Language Guidance. In European Conference on Computer Vision.
  6. DiffusionRig: Learning Personalized Priors for Facial Appearance Editing. arXiv:2304.06711.
  7. Prompt-to-Prompt Image Editing with Cross-Attention Control. In The Eleventh International Conference on Learning Representations.
  8. Imagic: Text-Based Real Image Editing with Diffusion Models. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
  9. DiffusionCLIP: Text-Guided Diffusion Models for Robust Image Manipulation. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2416–2425.
  10. Large Language Models are Zero-Shot Reasoners. In Oh, A. H.; Agarwal, A.; Belgrave, D.; and Cho, K., eds., Advances in Neural Information Processing Systems.
  11. CLIPstyler: Image Style Transfer with a Single Text Condition. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 18041–18050.
  12. MaskGAN: Towards Diverse and Interactive Facial Image Manipulation. arXiv:1907.11922.
  13. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. arXiv:2301.12597.
  14. Decoupled Weight Decay Regularization. arXiv:1711.05101.
  15. Cycle Encoding of a StyleGAN Encoder for Improved Reconstruction and Editability. In Proceedings of the 30th ACM International Conference on Multimedia, MM ’22, 2032–2041. New York, NY, USA: Association for Computing Machinery. ISBN 9781450392037.
  16. SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations. In International Conference on Learning Representations.
  17. A Diverse Corpus for Evaluating and Developing English Math Word Problem Solvers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, 975–984. Association for Computational Linguistics.
  18. Null-text Inversion for Editing Real Images using Guided Diffusion Models. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
  19. OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774.
  20. StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2065–2074.
  21. Learning Transferable Visual Models From Natural Language Supervision. In International Conference on Machine Learning.
  22. High-Resolution Image Synthesis with Latent Diffusion Models. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 10674–10685.
  23. Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding. ArXiv, abs/2205.11487.
  24. CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 4149–4158. Association for Computational Linguistics.
  25. SER-FIQ: Unsupervised Estimation of Face Image Quality Based on Stochastic Embedding Robustness. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 5650–5659.
  26. Self-Consistency Improves Chain of Thought Reasoning in Language Models. In The Eleventh International Conference on Learning Representations.
  27. Real-ESRGAN: Training Real-World Blind Super-Resolution with Pure Synthetic Data. arXiv:2107.10833.
  28. Chain of Thought Prompting Elicits Reasoning in Large Language Models. In Advances in Neural Information Processing Systems.
  29. From Continuity to Editability: Inverting GANs With Consecutive Images. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 13910–13918.
  30. ChatFace: Chat-Guided Real Face Editing via Diffusion Latent Space Manipulation. arXiv:2305.14742.
  31. MagicBrush: A Manually Annotated Dataset for Instruction-Guided Image Editing. arXiv:2306.10012.
Citations (1)

Summary

We haven't generated a summary for this paper yet.