Salient Object-Aware Background Generation using Text-Guided Diffusion Models (2404.10157v1)
Abstract: Generating background scenes for salient objects plays a crucial role across various domains including creative design and e-commerce, as it enhances the presentation and context of subjects by integrating them into tailored environments. Background generation can be framed as a task of text-conditioned outpainting, where the goal is to extend image content beyond a salient object's boundaries on a blank background. Although popular diffusion models for text-guided inpainting can also be used for outpainting by mask inversion, they are trained to fill in missing parts of an image rather than to place an object into a scene. Consequently, when used for background creation, inpainting models frequently extend the salient object's boundaries and thereby change the object's identity, which is a phenomenon we call "object expansion." This paper introduces a model for adapting inpainting diffusion models to the salient object outpainting task using Stable Diffusion and ControlNet architectures. We present a series of qualitative and quantitative results across models and datasets, including a newly proposed metric to measure object expansion that does not require any human labeling. Compared to Stable Diffusion 2.0 Inpainting, our proposed approach reduces object expansion by 3.6x on average with no degradation in standard visual metrics across multiple datasets.
- Blended diffusion for text-driven editing of natural images. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18187–18197, 2021.
- Blended latent diffusion. ACM Transactions on Graphics (TOG), 42:1 – 11, 2022.
- Paint by word. ArXiv, abs/2103.10951, 2021.
- Global contrast based salient region detection. IEEE TPAMI, 37(3):569–582, 2015.
- ABO: Dataset and benchmarks for real-world 3d object understanding. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 21094–21104, Los Alamitos, CA, USA, 2022. IEEE Computer Society.
- Diffedit: Diffusion-based semantic image editing with mask guidance. ArXiv, abs/2210.11427, 2022.
- ImageNet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009.
- Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883, 2021.
- Spiral generative network for image extrapolation. In Computer Vision – ECCV 2020, pages 701–717, Cham, 2020. Springer International Publishing.
- Prompt-to-prompt image editing with cross attention control. ArXiv, abs/2208.01626, 2022.
- CLIPScore: A reference-free evaluation metric for image captioning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7514–7528, Online and Punta Cana, Dominican Republic, 2021. Association for Computational Linguistics.
- GANs trained by a two time-scale update rule converge to a local nash equilibrium. In NIPS, 2017.
- Jonathan Ho. Classifier-free diffusion guidance. ArXiv, abs/2207.12598, 2022.
- Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems, pages 6840–6851. Curran Associates, Inc., 2020.
- Imagen video: High definition video generation with diffusion models. ArXiv, abs/2210.02303, 2022a.
- Cascaded diffusion models for high fidelity image generation. Journal of Machine Learning Research, 23(47):1–33, 2022b.
- Infinite images: Creating and exploring a large photorealistic virtual space. Proceedings of the IEEE, 98(8):1391–1407, 2010.
- Imagic: Text-based real image editing with diffusion models. ArXiv, abs/2210.09276, 2022.
- Revisiting image pyramid structure for high resolution salient object detection. In Computer Vision – ACCV 2022: 16th Asian Conference on Computer Vision, Macao, China, December 4–8, 2022, Proceedings, Part VII, page 257–273, Berlin, Heidelberg, 2023. Springer-Verlag.
- Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4015–4026, 2023.
- Boundless: Generative adversarial networks for image extension. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 10520–10529, 2019.
- BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. ArXiv, abs/2301.12597, 2023.
- The secrets of salient object segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014.
- Microsoft COCO: common objects in context. CoRR, abs/1405.0312, 2014.
- Decoupled weight decay regularization. In International Conference on Learning Representations, 2017.
- RePaint: Inpainting using denoising diffusion probabilistic models. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11451–11461, 2022.
- SDEdit: Guided image synthesis and editing with stochastic differential equations. In International Conference on Learning Representations, 2021.
- T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models, 2023.
- GLIDE: Towards photorealistic image generation and editing with text-guided diffusion models. In International Conference on Machine Learning, 2021.
- OpenAI. GPT-4 technical report, 2023.
- Highly accurate dichotomous image segmentation. In European Conference on Computer Vision, pages 38–56. Springer, 2022.
- Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, 2021.
- Zero-shot text-to-image generation. In Proceedings of the 38th International Conference on Machine Learning, pages 8821–8831. PMLR, 2021.
- Hierarchical text-conditional image generation with CLIP latents. ArXiv, abs/2204.06125, 2022.
- High-resolution image synthesis with latent diffusion models. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10674–10685, 2021.
- Photorealistic text-to-image diffusion models with deep language understanding. ArXiv, abs/2205.11487, 2022.
- LAION-5B: An open large-scale dataset for training next generation image-text models, 2022.
- Hierarchical image saliency detection on extended CSSD. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(4):717–729, 2016.
- Deep unsupervised learning using nonequilibrium thermodynamics. In Proceedings of the 32nd International Conference on Machine Learning, pages 2256–2265, Lille, France, 2015. PMLR.
- Stability.AI. Stable diffusion 2 base model card. https://huggingface.co/stabilityai/stable-diffusion-2-base, 2022a. Accessed: 2024-03-27.
- Stability.AI. Stable diffusion inpainting 2.0 model card. https://huggingface.co/stabilityai/stable-diffusion-2-inpainting, 2022b. Accessed: 2024-03-27.
- Resolution-robust large mask inpainting with Fourier convolutions. 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 3172–3182, 2021.
- Salient object detection: A discriminative regional feature integration approach. International Journal of Computer Vision, 123(2):251–268, 2017a.
- Learning to detect salient objects with image-level supervision. In CVPR, 2017b.
- BiggerPicture: Data-driven image extrapolation using graph matching. ACM Trans. Graph., 33(6), 2014.
- Pretraining is all you need for image-to-image translation. ArXiv, abs/2205.12952, 2022.
- Wide-context semantic image extrapolation. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1399–1408, 2019.
- What is and what is not a salient object? Learning salient object detector by ensembling linear exemplar regressors. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4399–4407, 2017.
- SmartBrush: Text and shape guided object inpainting with diffusion model. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22428–22437, 2022.
- Hierarchical saliency detection. In Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition, pages 1155–1162, USA, 2013. IEEE Computer Society.
- Saliency detection via graph-based manifold ranking. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3166–3173. IEEE, 2013.
- Towards high-resolution salient object detection. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 7233–7242, 2019.
- High-resolution image inpainting with iterative confidence feedback and guided upsampling. In Computer Vision – ECCV 2020, Cham, 2020. Springer International Publishing.
- Text-guided neural image inpainting. Proceedings of the 28th ACM International Conference on Multimedia, 2020.
- Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023.
- The unreasonable effectiveness of deep features as a perceptual metric. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 586–595, Los Alamitos, CA, USA, 2018. IEEE Computer Society.
- Framebreak: Dramatic image extrapolation by guided shift-maps. In 2013 IEEE Conference on Computer Vision and Pattern Recognition, pages 1171–1178, 2013.
- Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. ArXiv, abs/2207.06635, 2022.
- Image inpainting with cascaded modulation GAN and object-aware training. In Computer Vision – ECCV 2022, pages 277–296, Cham, 2022. Springer Nature Switzerland.
- Amir Erfan Eshratifar (12 papers)
- Kapil Thadani (5 papers)
- Shaunak Mishra (15 papers)
- Mikhail Kuznetsov (15 papers)
- Yueh-Ning Ku (2 papers)
- Paloma de Juan (5 papers)
- Joao V. B. Soares (2 papers)