Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Paste, Inpaint and Harmonize via Denoising: Subject-Driven Image Editing with Pre-Trained Diffusion Model (2306.07596v1)

Published 13 Jun 2023 in cs.CV and cs.AI

Abstract: Text-to-image generative models have attracted rising attention for flexible image editing via user-specified descriptions. However, text descriptions alone are not enough to elaborate the details of subjects, often compromising the subjects' identity or requiring additional per-subject fine-tuning. We introduce a new framework called \textit{Paste, Inpaint and Harmonize via Denoising} (PhD), which leverages an exemplar image in addition to text descriptions to specify user intentions. In the pasting step, an off-the-shelf segmentation model is employed to identify a user-specified subject within an exemplar image which is subsequently inserted into a background image to serve as an initialization capturing both scene context and subject identity in one. To guarantee the visual coherence of the generated or edited image, we introduce an inpainting and harmonizing module to guide the pre-trained diffusion model to seamlessly blend the inserted subject into the scene naturally. As we keep the pre-trained diffusion model frozen, we preserve its strong image synthesis ability and text-driven ability, thus achieving high-quality results and flexible editing with diverse texts. In our experiments, we apply PhD to both subject-driven image editing tasks and explore text-driven scene generation given a reference subject. Both quantitative and qualitative comparisons with baseline methods demonstrate that our approach achieves state-of-the-art performance in both tasks. More qualitative results can be found at \url{https://sites.google.com/view/phd-demo-page}.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (94)
  1. Hyperstyle: Stylegan inversion with hypernetworks for real image editing. In Proceedings of the IEEE/CVF conference on computer Vision and pattern recognition, pages 18511–18521, 2022.
  2. Blended diffusion for text-driven editing of natural images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18208–18218, 2022.
  3. Filling-in by joint interpolation of vector fields and gray levels. IEEE Transactions on Image Processing, 10(8):1200–1211, 2001.
  4. Visual prompting via image inpainting. Advances in Neural Information Processing Systems, 35:25005–25017, 2022.
  5. PatchMatch: A randomized correspondence algorithm for structural image editing. ACM Transactions on Graphics (Proc. SIGGRAPH), 28(3), Aug. 2009.
  6. Semantic photo manipulation with a generative image prior. arXiv preprint arXiv:2005.07727, 2020.
  7. Image inpainting. In Proceedings of the 27th annual conference on Computer graphics and interactive techniques, pages 417–424, 2000.
  8. Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096, 2018.
  9. Albumentations: Fast and flexible image augmentations. Information, 11(2), 2020.
  10. Toward realistic image compositing with adversarial learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8415–8424, 2019.
  11. BargainNet: Background-guided domain translation for image harmonization. In ICME, 2021.
  12. Dovenet: Deep image harmonization via domain verification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8394–8403, 2020.
  13. Improving the harmony of the composite image by spatial-separated attention module, 2019.
  14. Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 34:8780–8794, 2021.
  15. Texture synthesis by non-parametric sampling. In Proceedings of the seventh IEEE international conference on computer vision, volume 2, pages 1033–1038. IEEE, 1999.
  16. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883, 2021.
  17. Make-a-scene: Scene-based text-to-image generation with human priors. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XV, pages 89–106. Springer, 2022.
  18. An image is worth one word: Personalizing text-to-image generation using textual inversion, 2022.
  19. Designing an encoder for fast personalization of text-to-image models. arXiv preprint arXiv:2302.12228, 2023.
  20. Designing an encoder for fast personalization of text-to-image models. ArXiv, abs/2302.12228, 2023.
  21. Generative adversarial networks, 1–9. arXiv preprint arXiv:1406.2661, 2014.
  22. Zero-reference deep curve estimation for low-light image enhancement. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1780–1789, 2020.
  23. Image harmonization with attention-based deep feature modulation. In BMVC, volume 2, page 4, 2020.
  24. Scene completion using millions of photographs. ACM Transactions on Graphics (ToG), 26(3):4–es, 2007.
  25. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
  26. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020.
  27. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
  28. Exposure: A white-box photo post-processing framework. ACM Transactions on Graphics (TOG), 37(2):1–17, 2018.
  29. Globally and locally consistent image completion. ACM Transactions on Graphics (ToG), 36(4):1–14, 2017.
  30. Drag-and-drop pasting. ACM Transactions on graphics (TOG), 25(3):631–637, 2006.
  31. Ssh: a self-supervised framework for image harmonization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4832–4841, 2021.
  32. Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196, 2017.
  33. Alias-free generative adversarial networks. Advances in Neural Information Processing Systems, 34:852–863, 2021.
  34. Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8110–8119, 2020.
  35. Diffusionclip: Text-guided diffusion models for robust image manipulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2426–2435, 2022.
  36. Segment anything. arXiv preprint arXiv:2304.02643, 2023.
  37. Putting people in their place: Affordance-aware human insertion into scenes. arXiv preprint arXiv:2304.14406, 2023.
  38. Putting people in their place: Affordance-aware human insertion into scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  39. Multi-concept customization of text-to-image diffusion. CVPR, 2023.
  40. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. International Journal of Computer Vision, 128(7):1956–1981, 2020.
  41. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023.
  42. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
  43. Image inpainting for irregular holes using partial convolutions. In Proceedings of the European conference on computer vision (ECCV), pages 85–100, 2018.
  44. Compositional visual generation with composable diffusion models. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XVII, pages 423–439. Springer, 2022.
  45. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.
  46. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  47. Repaint: Inpainting using denoising diffusion probabilistic models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11461–11471, 2022.
  48. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453, 2023.
  49. Edgeconnect: Generative image inpainting with adversarial edge learning. arXiv preprint arXiv:1901.00212, 2019.
  50. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
  51. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, pages 8162–8171. PMLR, 2021.
  52. Mystyle: A personalized generative prior. ACM Transactions on Graphics (TOG), 41(6):1–10, 2022.
  53. Context encoders: Feature learning by inpainting. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2536–2544, 2016.
  54. U2-net: Going deeper with nested u-structure for salient object detection. volume 106, page 107404, 2020.
  55. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  56. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
  57. Zero-shot text-to-image generation. In International Conference on Machine Learning, pages 8821–8831. PMLR, 2021.
  58. Encoding in style: a stylegan encoder for image-to-image translation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2287–2296, 2021.
  59. Texture: Text-guided texturing of 3d shapes. arXiv preprint arXiv:2302.01721, 2023.
  60. High-resolution image synthesis with latent diffusion models, 2021.
  61. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022.
  62. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pages 234–241. Springer, 2015.
  63. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. 2022.
  64. Palette: Image-to-image diffusion models. In ACM SIGGRAPH 2022 Conference Proceedings, pages 1–10, 2022.
  65. Photorealistic text-to-image diffusion models with deep language understanding. arXiv preprint arXiv:2205.11487, 2022.
  66. Laion-5b: An open large-scale dataset for training next generation image-text models. arXiv preprint arXiv:2210.08402, 2022.
  67. Interpreting the latent space of gans for semantic face editing. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9243–9252, 2020.
  68. Closed-form factorization of latent semantics in gans. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1532–1540, 2021.
  69. Foreground-aware semantic representations for image harmonization. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1620–1629, 2021.
  70. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pages 2256–2265. PMLR, 2015.
  71. Denoising diffusion implicit models. arXiv:2010.02502, October 2020.
  72. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020.
  73. Spg-net: Segmentation prediction and guidance network for image inpainting. arXiv preprint arXiv:1805.03356, 2018.
  74. Veegan: Reducing mode collapse in gans using implicit variational learning. Advances in neural information processing systems, 30, 2017.
  75. Resolution-robust large mask inpainting with fourier convolutions. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 2149–2159, 2022.
  76. Df-gan: Deep fusion generative adversarial networks for text-to-image synthesis. arXiv preprint arXiv:2008.05865, 2020.
  77. Deep image harmonization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3789–3797, 2017.
  78. Plug-and-play diffusion features for text-driven image-to-image translation. arXiv preprint arXiv:2211.12572, 2022.
  79. Neural discrete representation learning. Advances in neural information processing systems, 30, 2017.
  80. Image inpainting via generative multi-column convolutional neural networks. Advances in neural information processing systems, 31, 2018.
  81. Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1316–1324, 2018.
  82. Dccf: Deep comprehensible color filter learning framework for high-resolution image harmonization. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part VII, pages 300–316. Springer, 2022.
  83. Paint by example: Exemplar-based image editing with diffusion models. arXiv preprint arXiv:2211.13227, 2022.
  84. Improving text-to-image synthesis using contrastive learning. arXiv preprint arXiv:2107.02423, 2021.
  85. Generative image inpainting with contextual attention. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5505–5514, 2018.
  86. Free-form image inpainting with gated convolution. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4471–4480, 2019.
  87. Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789, 2022.
  88. Diverse image inpainting with bidirectional and autoregressive transformers. In Proceedings of the 29th ACM International Conference on Multimedia, pages 69–78, 2021.
  89. Cross-modal contrastive learning for text-to-image generation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 833–842, 2021.
  90. Adding conditional control to text-to-image diffusion models, 2023.
  91. Deep image blending. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 231–240, 2020.
  92. Large scale image completion via co-modulated generative adversarial networks. arXiv preprint arXiv:2103.10428, 2021.
  93. Cm-gan: Image inpainting with cascaded modulation gan and object-aware training. arXiv preprint arXiv:2203.11947, 2022.
  94. Generative visual manipulation on the natural image manifold. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part V 14, pages 597–613. Springer, 2016.
Citations (19)

Summary

We haven't generated a summary for this paper yet.