Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Direct Inversion: Boosting Diffusion-based Editing with 3 Lines of Code (2310.01506v2)

Published 2 Oct 2023 in cs.CV

Abstract: Text-guided diffusion models have revolutionized image generation and editing, offering exceptional realism and diversity. Specifically, in the context of diffusion-based editing, where a source image is edited according to a target prompt, the process commences by acquiring a noisy latent vector corresponding to the source image via the diffusion model. This vector is subsequently fed into separate source and target diffusion branches for editing. The accuracy of this inversion process significantly impacts the final editing outcome, influencing both essential content preservation of the source image and edit fidelity according to the target prompt. Prior inversion techniques aimed at finding a unified solution in both the source and target diffusion branches. However, our theoretical and empirical analyses reveal that disentangling these branches leads to a distinct separation of responsibilities for preserving essential content and ensuring edit fidelity. Building on this insight, we introduce "Direct Inversion," a novel technique achieving optimal performance of both branches with just three lines of code. To assess image editing performance, we present PIE-Bench, an editing benchmark with 700 images showcasing diverse scenes and editing types, accompanied by versatile annotations and comprehensive evaluation metrics. Compared to state-of-the-art optimization-based inversion techniques, our solution not only yields superior performance across 8 editing methods but also achieves nearly an order of speed-up.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (54)
  1. Blended diffusion for text-driven editing of natural images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  18208–18218, 2022.
  2. Blended latent diffusion. ACM Transactions on Graphics (TOG), 42(4):1–11, 2023.
  3. ediffi: Text-to-image diffusion models with an ensemble of expert denoisers. arXiv preprint arXiv:2211.01324, 2022.
  4. Sega: Instructing diffusion using semantic dimensions. arXiv preprint arXiv:2301.12247, 2023.
  5. Instructpix2pix: Learning to follow image editing instructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  18392–18402, 2023.
  6. Masactrl: Tuning-free mutual self-attention control for consistent image synthesis and editing. arXiv preprint arXiv:2304.08465, 2023.
  7. Humanmac: Masked motion completion for human motion prediction. 2023.
  8. General image-to-image translation with one-shot image guidance. arXiv preprint arXiv:2307.14352, 2023.
  9. Diffedit: Diffusion-based semantic image editing with mask guidance. In International Conference on Learning Representations (ICLR), 2023.
  10. Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems (NeurIPS), 34:8780–8794, 2021.
  11. Prompt tuning inversion for text-driven image editing using diffusion models. arXiv preprint arXiv:2305.04441, 2023.
  12. Gradient-free textual inversion. arXiv preprint arXiv:2304.05818, 2023.
  13. Training-free structured diffusion guidance for compositional text-to-image synthesis. arXiv preprint arXiv:2212.05032, 2022.
  14. An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618, 2022.
  15. Instructdiffusion: A generalist modeling interface for vision tasks. arXiv preprint arXiv:2309.03895, 2023.
  16. Improving tuning-free real image editing with proximal guidance. CoRR, 2023.
  17. Prompt-to-prompt image editing with cross-attention control. In International Conference on Learning Representations (ICLR), 2023.
  18. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
  19. Reversion: Diffusion-based relation inversion from images. arXiv preprint arXiv:2303.13495, 2023.
  20. An edit friendly ddpm noise space: Inversion and manipulations. arXiv preprint arXiv:2304.06140, 2023.
  21. Iterative multi-granular image editing using diffusion models. arXiv preprint arXiv:2309.00613, 2023.
  22. Humansd: A native skeleton-guided diffusion model for human image generation. arXiv preprint arXiv:2304.04269, 2023.
  23. Imagic: Text-based real image editing with diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  6007–6017, 2023.
  24. Diffusionclip: Text-guided diffusion models for robust image manipulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  2426–2435, 2022.
  25. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023a.
  26. Stylediffusion: Prompt-embedding inversion for text-based editing. arXiv preprint arXiv:2303.15649, 2023b.
  27. Video-p2p: Video editing with cross-attention control. arXiv preprint arXiv:2303.04761, 2023.
  28. SDEdit: Guided image synthesis and editing with stochastic differential equations. In International Conference on Learning Representations (ICLR), 2022.
  29. Watch your steps: Local image and scene editing by text instructions. arXiv preprint arXiv:2308.08947, 2023.
  30. Negative-prompt inversion: Fast image inversion for editing with text-guided diffusion models. arXiv preprint arXiv:2305.16807, 2023.
  31. Null-text inversion for editing real images using guided diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  6038–6047, 2023.
  32. Dragondiffusion: Enabling drag-style manipulation on diffusion models. arXiv preprint arXiv:2307.02421, 2023.
  33. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. In International Conference on Machine Learning (ICML), pp.  16784–16804. PMLR, 2022.
  34. OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023. URL https://api.semanticscholar.org/CorpusID:257532815.
  35. Zero-shot image-to-image translation. In Special Interest Group on Computer Graphics and Interactive Techniques (SIGGRAPH), pp.  1–11, 2023.
  36. Fatezero: Fusing attentions for zero-shot text-based video editing. arXiv preprint arXiv:2303.09535, 2023.
  37. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning (ICML), pp.  8748–8763. PMLR, 2021.
  38. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
  39. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  10684–10695, 2022.
  40. Dragdiffusion: Harnessing diffusion models for interactive point-based image editing. arXiv preprint arXiv:2306.14435, 2023.
  41. Denoising diffusion implicit models. In International Conference on Learning Representations (ICLR), 2020.
  42. LEDITS: Real image editing with ddpm inversion and semantic guidance. arXiv preprint arXiv:2307.00522, 2023.
  43. Splicing vit features for semantic appearance transfer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  10748–10757, 2022.
  44. Plug-and-play diffusion features for text-driven image-to-image translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  1921–1930, 2023.
  45. Unitune: Text-driven image editing by fine tuning an image generation model on a single image. arXiv preprint arXiv:2210.09477, 2022.
  46. Edict: Exact diffusion inversion via coupled transformations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  22532–22541, 2023.
  47. Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing, 13(4):600–612, 2004.
  48. GODIVA: Generating open-domain videos from natural descriptions. arXiv preprint arXiv:2104.14806, 2021.
  49. Uncovering the disentanglement capability in text-to-image diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  1900–1910, 2023.
  50. Gan inversion: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 45(3):3121–3138, 2022.
  51. Freedom: Training-free energy-guided conditional diffusion model. arXiv preprint arXiv:2303.09833, 2023.
  52. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  586–595, 2018.
  53. Real-world image variation by aligning diffusion inversion chain. arXiv preprint arXiv:2305.18729, 2023a.
  54. Sine: Single image editing with text-to-image diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  6027–6037, 2023b.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Xuan Ju (19 papers)
  2. Ailing Zeng (58 papers)
  3. Yuxuan Bian (9 papers)
  4. Shaoteng Liu (17 papers)
  5. Qiang Xu (129 papers)
Citations (54)