Zero-Shot Image Harmonization with Generative Model Prior (2307.08182v2)
Abstract: We propose a zero-shot approach to image harmonization, aiming to overcome the reliance on large amounts of synthetic composite images in existing methods. These methods, while showing promising results, involve significant training expenses and often struggle with generalization to unseen images. To this end, we introduce a fully modularized framework inspired by human behavior. Leveraging the reasoning capabilities of recent foundation models in language and vision, our approach comprises three main stages. Initially, we employ a pretrained vision-LLM (VLM) to generate descriptions for the composite image. Subsequently, these descriptions guide the foreground harmonization direction of a text-to-image generative model (T2I). We refine text embeddings for enhanced representation of imaging conditions and employ self-attention and edge maps for structure preservation. Following each harmonization iteration, an evaluator determines whether to conclude or modify the harmonization direction. The resulting framework, mirroring human behavior, achieves harmonious results without the need for extensive training. We present compelling visual results across diverse scenes and objects, along with a user study validating the effectiveness of our approach.
- Blended diffusion for text-driven editing of natural images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18208–18218.
- Scene perception: Detecting and judging objects undergoing relational violations. Cognitive psychology 14, 2 (1982), 143–177.
- On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021).
- Learning photographic global tonal adjustment with a database of input/output image pairs. In CVPR 2011. IEEE, 97–104.
- Bor-Chun Chen and Andrew Kae. 2019. Toward realistic image compositing with adversarial learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8415–8424.
- Diffusion Models for Imperceptible and Transferable Adversarial Attack. arXiv preprint arXiv:2305.08192 (2023).
- Dense Pixel-to-Pixel Harmonization via Continuous Image Representation. arXiv preprint arXiv:2303.01681 (2023).
- Anydoor: Zero-shot object-level image customization. arXiv preprint arXiv:2307.09481 (2023).
- High-Resolution Image Harmonization via Collaborative Dual Transformations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18470–18479.
- Dovenet: Deep image harmonization via domain verification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8394–8403.
- Jodi L Davenport and Mary C Potter. 2004. Scene consistency in object and background perception. Psychological science 15, 8 (2004), 559–564.
- Prafulla Dhariwal and Alexander Nichol. 2021. Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems 34 (2021), 8780–8794.
- Cogview: Mastering text-to-image generation via transformers. Advances in Neural Information Processing Systems 34 (2021), 19822–19835.
- Histogram-based prefiltering for luminance and chrominance compensation of multiview video. IEEE Transactions on Circuits and Systems for Video Technology 18, 9 (2008), 1258–1267.
- Training-Free Structured Diffusion Guidance for Compositional Text-to-Image Synthesis. arXiv preprint arXiv:2212.05032 (2022).
- An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618 (2022).
- Generative Adversarial Nets. In Advances in Neural Information Processing Systems, Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K.Q. Weinberger (Eds.), Vol. 27. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2014/file/5ca3e9b122f61f8f06494c97b1afccf3-Paper.pdf
- Transformer for Image Harmonization and Beyond. IEEE Transactions on Pattern Analysis and Machine Intelligence (2022).
- Intrinsic image harmonization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16367–16376.
- SCS-Co: Self-Consistent Style Contrastive Learning for Image Harmonization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 19710–19719.
- Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626 (2022).
- Jonathan Ho and Tim Salimans. 2021. Classifier-Free Diffusion Guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications.
- Ssh: A self-supervised framework for image harmonization. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 4832–4841.
- Design of an image edge detection filter using the Sobel operator. IEEE Journal of solid-state circuits 23, 2 (1988), 358–367.
- Hakki Can Karaimer and Michael S Brown. 2016. A software platform for manipulating the camera imaging pipeline. In European Conference on Computer Vision. Springer, 429–444.
- Harmonizer: Learning to perform white-box image and video harmonization. In European Conference on Computer Vision. Springer, 690–706.
- Transient attributes for high-level understanding and editing of outdoor scenes. ACM Transactions on graphics (TOG) 33, 4 (2014), 1–11.
- Jean-Francois Lalonde and Alexei A Efros. 2007. Using color compatibility for assessing image realism. In 2007 IEEE 11th International Conference on Computer Vision. IEEE, 1–8.
- Automatic content-aware color and tone stylization. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2470–2478.
- Spatial-separated curve rendering network for efficient and high-resolution image harmonization. In European Conference on Computer Vision. Springer, 334–349.
- Microsoft coco: Common objects in context. In European conference on computer vision. Springer, 740–755.
- Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017).
- Null-text Inversion for Editing Real Images using Guided Diffusion Models. arXiv preprint arXiv:2211.09794 (2022).
- N-dimensional probability density function transfer and its application to color transfer. In Tenth IEEE International Conference on Computer Vision (ICCV’05) Volume 1, Vol. 2. IEEE, 1434–1439.
- Automated colour grading using colour distribution transfer. Computer Vision and Image Understanding 107, 1-2 (2007), 123–137.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125 (2022).
- Zero-shot text-to-image generation. In International Conference on Machine Learning. PMLR, 8821–8831.
- Color transfer between images. IEEE Computer graphics and applications 21, 5 (2001), 34–41.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10684–10695.
- Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding. arXiv preprint arXiv:2205.11487 (2022).
- Eli Shechtman and Michal Irani. 2007. Matching local self-similarities across images and videos. In 2007 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1–8.
- Foreground-aware semantic representations for image harmonization. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 1620–1629.
- Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502 (2020).
- Pixel difference networks for efficient edge detection. In Proceedings of the IEEE/CVF international conference on computer vision. 5117–5127.
- Multi-scale image harmonization. ACM Transactions on Graphics (TOG) 29, 4 (2010), 1–10.
- Deep image harmonization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3789–3797.
- Splicing ViT Features for Semantic Appearance Transfer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10748–10757.
- Semi-supervised Parametric Real-world Image Harmonization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5927–5936.
- Zero-Shot Image Restoration Using Denoising Diffusion Null-Space Model. arXiv preprint arXiv:2212.00490 (2022).
- Uncovering the Disentanglement Capability in Text-to-Image Diffusion Models. arXiv preprint arXiv:2212.08698 (2022).
- Xuezhong Xiao and Lizhuang Ma. 2006. Color transfer in correlated color space. In Proceedings of the 2006 ACM international conference on Virtual reality continuum and its applications. 305–309.
- SmartBrush: Text and Shape Guided Object Inpainting with Diffusion Model. arXiv preprint arXiv:2212.05034 (2022).
- Dccf: Deep comprehensible color filter learning framework for high-resolution image harmonization. In European Conference on Computer Vision. Springer, 300–316.
- Understanding and improving the realism of image composites. ACM Transactions on graphics (TOG) 31, 4 (2012), 1–10.
- Paint by example: Exemplar-based image editing with diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18381–18391.
- Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789 (2022).
- Learning image-adaptive 3d lookup tables for high performance photo enhancement in real-time. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020).
- Adversarial image composition with auxiliary illumination. In Proceedings of the Asian Conference on Computer Vision.
- Controlcom: Controllable image composition using diffusion model. arXiv preprint arXiv:2308.10040 (2023).
- Semantic understanding of scenes through the ade20k dataset. International Journal of Computer Vision 127, 3 (2019), 302–321.
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.