Shape-Guided Diffusion with Inside-Outside Attention
Abstract: We introduce precise object silhouette as a new form of user control in text-to-image diffusion models, which we dub Shape-Guided Diffusion. Our training-free method uses an Inside-Outside Attention mechanism during the inversion and generation process to apply a shape constraint to the cross- and self-attention maps. Our mechanism designates which spatial region is the object (inside) vs. background (outside) then associates edits to the correct region. We demonstrate the efficacy of our method on the shape-guided editing task, where the model must replace an object according to a text prompt and object mask. We curate a new ShapePrompts benchmark derived from MS-COCO and achieve SOTA results in shape faithfulness without a degradation in text alignment or image realism according to both automatic metrics and annotator ratings. Our data and code will be made available at https://shape-guided-diffusion.github.io.
- Blended diffusion for text-driven editing of natural images. In CVPR, 2022.
- Instructpix2pix: Learning to follow image editing instructions. arXiv preprint arXiv:2211.09800, November 2022.
- Coco-stuff: Thing and stuff classes in context. In CVPR, 2018.
- Per-pixel classification is not all you need for semantic segmentation. In NeurIPS, 2021.
- Diffusion models beat gans on image synthesis. In NeurIPS, 2021.
- Make-a-scene: Scene-based text-to-image generation with human priors. arXiv preprint arXiv:2203.13131, 2022.
- An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618, 2022.
- Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626, 2022.
- Denoising diffusion probabilistic models. In NeurIPS, 2020.
- Globally and locally consistent image completion. ACM Transactions on Graphics (ToG), 36(4):1–14, 2017.
- A style-based generator architecture for generative adversarial networks. In CVPR, 2019.
- Imagic: Text-based real image editing with diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6007–6017, 2023.
- Diffusionclip: Text-guided diffusion models for robust image manipulation. In CVPR, 2022.
- Microsoft coco: Common objects in context. In ECCV, 2014.
- Image inpainting for irregular holes using partial convolutions. In ECCV, 2018.
- Repaint: Inpainting using denoising diffusion probabilistic models. In CVPR, 2022.
- SDEdit: Guided image synthesis and editing with stochastic differential equations. In International Conference on Learning Representations, 2022.
- Runway ML. Stable diffusion inpainting. https://huggingface.co/runwayml/stable-diffusion-inpainting, 2022.
- Null-text inversion for editing real images using guided diffusion models. arXiv preprint arXiv:2211.09794, 2022.
- Glide: Towards photorealistic image generation and editing with text-guided diffusion models. In ICML, 2022.
- On aliased resizing and surprising subtleties in gan evaluation. In CVPR, 2022.
- Styleclip: Text-driven manipulation of stylegan imagery. In ICCV, 2021.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- Zero-shot text-to-image generation. In ICML, 2021.
- High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
- Basic objects in natural categories. Cognitive Psychology, 8:382–439, 1976.
- Photorealistic text-to-image diffusion models with deep language understanding. arXiv preprint arXiv:2205.11487, 2022.
- Sharif Shameem. Lexica. https://lexica.art/, 2022.
- Deep unsupervised learning using nonequilibrium thermodynamics. In ICML, 2015.
- Denoising diffusion implicit models. In ICLR, 2021.
- Score-based generative modeling through stochastic differential equations. In ICLR, 2021.
- Deep high-resolution representation learning for human pose estimation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5693–5703, 2019.
- Resolution-robust large mask inpainting with fourier convolutions. arXiv preprint arXiv:2109.07161, 2021.
- Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2818–2826, 2016.
- Plug-and-play diffusion features for text-driven image-to-image translation. arXiv preprint arXiv:2211.12572, 2022.
- Unitune: Text-driven image editing by fine tuning an image generation model on a single image. arXiv preprint arXiv:2210.09477, 2022.
- Imagen editor and editbench: Advancing and evaluating text-guided image inpainting. arXiv preprint arXiv:2212.06909, 2023.
- Chen Henry Wu and Fernando De la Torre. Unifying diffusion models’ latent space, with applications to cyclediffusion and guidance. arXiv preprint arXiv:2210.05559, 2022.
- Ap-10k: A benchmark for animal pose estimation in the wild. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021.
- Generative image inpainting with contextual attention. In CVPR, 2018.
- Free-form image inpainting with gated convolution. In ICCV, 2019.
- Shape-guided object inpainting. arXiv preprint arXiv:2204.07845, 2022.
- Large scale image completion via co-modulated generative adversarial networks. 2021.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.