Training-and-Prompt-Free General Painterly Harmonization via Zero-Shot Disentenglement on Style and Content References (2404.12900v2)
Abstract: Painterly image harmonization aims at seamlessly blending disparate visual elements within a single image. However, previous approaches often struggle due to limitations in training data or reliance on additional prompts, leading to inharmonious and content-disrupted output. To surmount these hurdles, we design a Training-and-prompt-Free General Painterly Harmonization method (TF-GPH). TF-GPH incorporates a novel Similarity Disentangle Mask'', which disentangles the foreground content and background image by redirecting their attention to corresponding reference images, enhancing the attention mechanism for multi-image inputs. Additionally, we propose a
Similarity Reweighting'' mechanism to balance harmonization between stylization and content preservation. This mechanism minimizes content disruption by prioritizing the content-similar features within the given background style reference. Finally, we address the deficiencies in existing benchmarks by proposing novel range-based evaluation metrics and a new benchmark to better reflect real-world applications. Extensive experiments demonstrate the efficacy of our method in all benchmarks. More detailed in https://github.com/BlueDyee/TF-GPH.
- Blended latent diffusion. ACM Transactions on Graphics (TOG) 42, 4 (2023), 1–11.
- Compositional gan: Learning image-conditional binary composition. International Journal of Computer Vision 128 (2020), 2570–2585.
- Align your latents: High-resolution video synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 22563–22575.
- Diffdreamer: Towards consistent unsupervised single-view scene extrapolation with conditional diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2139–2150.
- Painterly image harmonization in dual domains. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 37. 268–276.
- Masactrl: Tuning-free mutual self-attention control for consistent image synthesis and editing. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 22560–22570.
- Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models. ACM Transactions on Graphics (TOG) 42, 4 (2023), 1–10.
- Hierarchical dynamic image harmonization. In Proceedings of the 31st ACM International Conference on Multimedia. 1422–1430.
- General image-to-image translation with one-shot image guidance. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 22736–22746.
- Dovenet: Deep image harmonization via domain verification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 8394–8403.
- DiffEdit: Diffusion-based Semantic Image Editing with Mask Guidance. In International Conference on Learning Representations.
- Stytr2: Image style transfer with transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 11326–11336.
- Prafulla Dhariwal and Alexander Nichol. 2021. Diffusion models beat gans on image synthesis. Advances in neural information processing systems 34 (2021), 8780–8794.
- Diffusion self-guidance for controllable image generation. Advances in Neural Information Processing Systems 36 (2023), 16222–16239.
- Stylegan-nada: Clip-guided domain adaptation of image generators. ACM Transactions on Graphics (TOG) 41, 4 (2022), 1–13.
- Controlling perceptual factors in neural style transfer. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3985–3993.
- Photoswap: Personalized subject swapping in images. Advances in Neural Information Processing Systems 36 (2024).
- Prompt-to-Prompt Image Editing with Cross-Attention Control. In International Conference on Learning Representations.
- Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303 (2022).
- Denoising diffusion probabilistic models. Advances in neural information processing systems 33 (2020), 6840–6851.
- Video diffusion models. Advances in Neural Information Processing Systems 35 (2022), 8633–8646.
- Shadow generation for composite image in real-world scenes. In Proceedings of the AAAI conference on artificial intelligence, Vol. 36. 914–922.
- QuantArt: Quantizing image style transfer towards high visual fidelity. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5947–5956.
- Xun Huang and Serge Belongie. 2017. Arbitrary style transfer in real-time with adaptive instance normalization. In Proceedings of the IEEE international conference on computer vision. 1501–1510.
- Training-free Content Injection using h-space in Diffusion Models. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 5151–5161.
- Ssh: A self-supervised framework for image harmonization. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 4832–4841.
- Variational diffusion models. Advances in neural information processing systems 34 (2021), 21696–21707.
- Gihyun Kwon and Jong Chul Ye. 2022. Clipstyler: Image style transfer with a single text condition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18062–18071.
- Diffusion Models Already Have A Semantic Latent Space. In International Conference on Learning Representations.
- Jean-Francois Lalonde and Alexei A. Efros. 2007. Using Color Compatibility for Assessing Image Realism. In IEEE International Conference on Computer Vision. 1–8.
- Demystifying neural style transfer. In Proceedings of the 26th International Joint Conference on Artificial Intelligence. 2230–2236.
- Magicmix: Semantic mixing with diffusion models. arXiv preprint arXiv:2210.16056 (2022).
- Magic3d: High-resolution text-to-3d content creation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 300–309.
- St-gan: Spatial transformer generative adversarial networks for image compositing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9455–9464.
- Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13. Springer, 740–755.
- Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. Advances in Neural Information Processing Systems 35 (2022), 5775–5787.
- Painterly image harmonization using diffusion model. In Proceedings of the 31st ACM International Conference on Multimedia. 233–241.
- Tf-icon: Diffusion-based training-free cross-domain image composition. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2294–2305.
- Deep painterly harmonization. In Computer graphics forum, Vol. 37. Wiley Online Library, 95–106.
- Repaint: Inpainting using denoising diffusion probabilistic models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 11461–11471.
- SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations. In International Conference on Learning Representations.
- Nerf: Representing scenes as neural radiance fields for view synthesis. Commun. ACM 65, 1 (2021), 99–106.
- Alexander Quinn Nichol and Prafulla Dhariwal. 2021. Improved denoising diffusion probabilistic models. In International conference on machine learning. PMLR, 8162–8171.
- Making images real again: A comprehensive survey on deep image composition. arXiv preprint arXiv:2106.14490 (2021).
- Element-Embedded Style Transfer Networks for Style Harmonization.. In British Machine Vision Conference ,BMVC.
- Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748–8763.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10684–10695.
- Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 22500–22510.
- DragDiffusion: Harnessing Diffusion Models for Interactive Point-based Image Editing. arXiv preprint arXiv:2306.14435 (2023).
- Denoising Diffusion Implicit Models. In International Conference on Learning Representations.
- Score-Based Generative Modeling through Stochastic Differential Equations. In International Conference on Learning Representations.
- Deep image harmonization in dual color spaces. In Proceedings of the 31st ACM International Conference on Multimedia. 2159–2167.
- Improved ArtGAN for Conditional Synthesis of Natural Image and Artwork. IEEE Transactions on Image Processing 28, 1 (2019), 394–409. https://doi.org/10.1109/TIP.2018.2866698
- Deep image harmonization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3789–3797.
- Plug-and-Play Diffusion Features for Text-Driven Image-to-Image Translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 1921–1930.
- Attention is all you need. Advances in neural information processing systems 30 (2017).
- Gp-gan: Towards realistic high-resolution image blending. In Proceedings of the 27th ACM international conference on multimedia. 2487–2495.
- Composite photograph harmonization with complete background cues. In Proceedings of the 30th ACM international conference on multimedia. 2296–2304.
- Style Image Harmonization via Global-Local Style Mutual Guided. In Proceedings of the Asian Conference on Computer Vision. 2306–2321.
- Paint by example: Exemplar-based image editing with diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18381–18391.
- Deep image compositing. In Proceedings of the IEEE/CVF winter conference on applications of computer vision. 365–374.
- Deep image blending. In Proceedings of the IEEE/CVF winter conference on applications of computer vision. 231–240.
- The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition. 586–595.
- Inversion-based style transfer with diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10146–10156.