Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

HD-Painter: High-Resolution and Prompt-Faithful Text-Guided Image Inpainting with Diffusion Models (2312.14091v3)

Published 21 Dec 2023 in cs.CV

Abstract: Recent progress in text-guided image inpainting, based on the unprecedented success of text-to-image diffusion models, has led to exceptionally realistic and visually plausible results. However, there is still significant potential for improvement in current text-to-image inpainting models, particularly in better aligning the inpainted area with user prompts and performing high-resolution inpainting. Therefore, we introduce HD-Painter, a training free approach that accurately follows prompts and coherently scales to high resolution image inpainting. To this end, we design the Prompt-Aware Introverted Attention (PAIntA) layer enhancing self-attention scores by prompt information resulting in better text aligned generations. To further improve the prompt coherence we introduce the Reweighting Attention Score Guidance (RASG) mechanism seamlessly integrating a post-hoc sampling strategy into the general form of DDIM to prevent out-of-distribution latent shifts. Moreover, HD-Painter allows extension to larger scales by introducing a specialized super-resolution technique customized for inpainting, enabling the completion of missing regions in images of up to 2K resolution. Our experiments demonstrate that HD-Painter surpasses existing state-of-the-art approaches quantitatively and qualitatively across multiple metrics and a user study. Code is publicly available at: https://github.com/Picsart-AI-Research/HD-Painter

Definition Search Book Streamline Icon: https://streamlinehq.com
References (47)
  1. Blended diffusion for text-driven editing of natural images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18208–18218, 2022.
  2. Blended latent diffusion. ACM Transactions on Graphics (TOG), 42(4):1–11, 2023.
  3. Instructpix2pix: Learning to follow image editing instructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18392–18402, 2023.
  4. Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models. ACM Transactions on Graphics (TOG), 42(4):1–10, 2023.
  5. Mmdetection: Open mmlab detection toolbox and benchmark. arxiv 2019. arXiv preprint arXiv:1906.07155, 2019.
  6. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021.
  7. Diffusion self-guidance for controllable image generation. arXiv preprint arXiv:2306.00986, 2023.
  8. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883, 2021.
  9. An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618, 2022.
  10. Pair-diffusion: Object-level image editing with structure-and-appearance paired diffusion models. arXiv preprint arXiv:2303.17546, 2023.
  11. Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626, 2022.
  12. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  13. Itseez. Open source computer vision library. https://github.com/itseez/opencv, 2015.
  14. Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8110–8119, 2020.
  15. Pick-a-pic: An open dataset of user preferences for text-to-image generation. arXiv preprint arXiv:2305.01569, 2023.
  16. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
  17. Image inpainting for irregular holes using partial convolutions. In Proceedings of the European conference on computer vision (ECCV), pages 85–100, 2018.
  18. Specialist diffusion: Plug-and-play sample-efficient fine-tuning of text-to-image diffusion models to learn any unseen style. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14267–14276, 2023.
  19. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453, 2023.
  20. Image inpainting with onion convolutions. In proceedings of the asian conference on computer vision, 2020.
  21. Nuwa-lip: Language-guided image inpainting with defect-free vqgan. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14183–14192, 2023.
  22. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
  23. Poisson Image Editing. Association for Computing Machinery, New York, NY, USA, 1 edition, 2023.
  24. Simulacra aesthetic captions. Technical Report Version 1.0, Stability AI, 2022.  url https://github.com/JD-P/simulacra-aesthetic-captions .
  25. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  26. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
  27. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
  28. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pages 234–241. Springer, 2015.
  29. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22500–22510, 2023.
  30. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022.
  31. Mi-gan: A simple baseline for image inpainting on mobile devices. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 7335–7345, 2023.
  32. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294, 2022.
  33. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pages 2256–2265. PMLR, 2015.
  34. Denoising diffusion implicit models. In International Conference on Learning Representations, 2021.
  35. Neural discrete representation learning. Advances in neural information processing systems, 30, 2017.
  36. Imagen editor and editbench: Advancing and evaluating text-guided image inpainting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18359–18369, 2023.
  37. Nüwa: Visual synthesis pre-training for neural visual world creation. In European conference on computer vision, pages 720–736. Springer, 2022.
  38. Smartbrush: Text and shape guided object inpainting with diffusion model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22428–22437, 2023.
  39. Prompt-free diffusion: Taking” text” out of text-to-image diffusion models. arXiv preprint arXiv:2305.16223, 2023a.
  40. Image completion with heterogeneously filtered spectral hints. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 4591–4601, 2023b.
  41. Versatile diffusion: Text, images and variations all in one diffusion model. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7754–7765, 2023c.
  42. Contextual residual aggregation for ultra high-resolution image inpainting. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7508–7517, 2020.
  43. Generative image inpainting with contextual attention. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5505–5514, 2018.
  44. Free-form image inpainting with gated convolution. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4471–4480, 2019.
  45. Adding conditional control to text-to-image diffusion models. arXiv preprint arXiv:2302.05543, 2023.
  46. Large scale image completion via co-modulated generative adversarial networks. arXiv preprint arXiv:2103.10428, 2021.
  47. Cm-gan: Image inpainting with cascaded modulation gan and object-aware training. arXiv preprint arXiv:2203.11947, 2022.
Citations (17)

Summary

  • The paper introduces HD-Painter, a high-resolution inpainting model that leverages novel PAIntA and RASG mechanisms to enhance text-image alignment, achieving a significant accuracy improvement.
  • The PAIntA layer dynamically adjusts attention scores based on the relevance of user prompts, ensuring that inpainted regions coherently reflect the provided textual guidance.
  • The model scales to 2K resolution using a specialized super-resolution module, seamlessly integrating inpainted content with surrounding image details.

Introducing HD-Painter

High-Resolution Image Inpainting

The process of image inpainting involves filling missing regions within an image in a consistent and visually plausible way. One of the challenges with current inpainting models lies in ensuring the filled content is well-aligned with user prompts, particularly in high-resolution images. The paper introduces HD-Painter, an approach that improves the quality of text-guided image inpainting at resolutions as high as 2K, aligning more closely with the intentions of the user's text prompt.

Enhanced Attention with PAIntA

The proposed HD-Painter utilizes a Prompt-Aware Introverted Attention (PAIntA) layer. This innovation augments standard self-attention mechanisms by considering the given textual prompt. It increases or reduces the impact of attention scores based on their relevance to the prompt. This attention modulation improves the coherence between the inpainted area and the textual instructions provided by the user. By focusing on prompt-related aspects of the image, PAIntA reduces the undue impact of the background or adjacent objects that may otherwise overshadow the user’s input.

More Focused Guidance with RASG

To enhance text-alignment even further, the paper introduces a Reweighting Attention Score Guidance (RASG) mechanism. This post-hoc method integrates into the diffusion process, aligning the generation more tightly with the text prompt while preserving the quality of the image. RASG uses gradient-based guidance, reweighted in a way that respects the original latent space distribution. By preventing inappropriate shifts in the sampling process, it ensures that the inpainted regions not only match the text prompt but also remain within the field of what the underlying model was trained to reproduce.

Scaling to Higher Resolutions

HD-Painter also features a specialized super-resolution technique crafted specifically for inpainting applications. This component is crucial for high-resolution image completion, as it intends to leverage the detailed information from known regions. The output from the lower-resolution inpainted image serves as conditional input to a diffusion process for image upscaling, allowing for smooth transitions in up to 2048 × 2048 resolution images.

Performance and Contributions

The experiments showcase HD-Painter's superior performance over existing methods both in qualitative and quantitative evaluations, including a significant accuracy improvement of 61.4% versus 51.9%. This advanced capability stems from combining PAIntA and RASG, both of which are pluggable and can enhance any diffusion-based inpainting model. The paper commits to releasing the code publicly, enabling further research and development in this space.

X Twitter Logo Streamline Icon: https://streamlinehq.com