Eliminating Contextual Prior Bias for Semantic Image Editing via Dual-Cycle Diffusion
Abstract: The recent success of text-to-image generation diffusion models has also revolutionized semantic image editing, enabling the manipulation of images based on query/target texts. Despite these advancements, a significant challenge lies in the potential introduction of contextual prior bias in pre-trained models during image editing, e.g., making unexpected modifications to inappropriate regions. To address this issue, we present a novel approach called Dual-Cycle Diffusion, which generates an unbiased mask to guide image editing. The proposed model incorporates a Bias Elimination Cycle that consists of both a forward path and an inverted path, each featuring a Structural Consistency Cycle to ensure the preservation of image content during the editing process. The forward path utilizes the pre-trained model to produce the edited image, while the inverted path converts the result back to the source image. The unbiased mask is generated by comparing differences between the processed source image and the edited image to ensure that both conform to the same distribution. Our experiments demonstrate the effectiveness of the proposed method, as it significantly improves the D-CLIP score from 0.272 to 0.283. The code will be available at https://github.com/JohnDreamer/DualCycleDiffsion.
- H. Yan, H. Zhang, J. Shi, and J. Ma, “Texture brush for fashion inspiration transfer: A generative adversarial network with heatmap-guided semantic disentanglement,” IEEE TCSVT, 2022.
- R. Mokady, A. Hertz, and A. H. Bermano, “Clipcap: Clip prefix for image captioning,” arXiv preprint arXiv:2111.09734, 2021.
- S. Gu, J. Bao, H. Yang, D. Chen, F. Wen, and L. Yuan, “Mask-guided portrait editing with conditional gans,” in Proceedings of CVPR, 2019, pp. 3436–3445.
- Z. Xu, T. Lin, H. Tang, F. Li, D. He, N. Sebe, R. Timofte, L. Van Gool, and E. Ding, “Predict, prevent, and evaluate: Disentangled text-driven image manipulation empowered by pre-trained vision-language model,” in Proceedings of CVPR, 2022, pp. 18 229–18 238.
- B. Li, X. Qi, T. Lukasiewicz, and P. H. Torr, “Manigan: Text-guided image manipulation,” in Proceedings of CVPR, 2020, pp. 7880–7889.
- T. Zhang, H.-Y. Tseng, L. Jiang, W. Yang, H. Lee, and I. Essa, “Text as neural operator: Image manipulation by text instruction,” in Proceedings of the 29th ACM MM, 2021, pp. 1893–1902.
- C. H. Wu and F. De la Torre, “Unifying diffusion models’ latent space, with applications to cyclediffusion and guidance,” arXiv preprint arXiv:2210.05559, 2022.
- J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” NeurIPS, vol. 33, pp. 6840–6851, 2020.
- J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit models,” in ICLR, 2021.
- A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in ICML. PMLR, 2021, pp. 8748–8763.
- Y. Balaji, S. Nah, X. Huang, A. Vahdat, J. Song, K. Kreis, M. Aittala, T. Aila, S. Laine, B. Catanzaro et al., “ediffi: Text-to-image diffusion models with an ensemble of expert denoisers,” arXiv preprint arXiv:2211.01324, 2022.
- O. Patashnik, Z. Wu, E. Shechtman, D. Cohen-Or, and D. Lischinski, “Styleclip: Text-driven manipulation of stylegan imagery,” in Proceedings of ICCV, 2021, pp. 2085–2094.
- C. Meng, Y. He, Y. Song, J. Song, J. Wu, J.-Y. Zhu, and S. Ermon, “Sdedit: Guided image synthesis and editing with stochastic differential equations,” in ICLR, 2021.
- X. Su, J. Song, C. Meng, and S. Ermon, “Dual diffusion implicit bridges for image-to-image translation,” arXiv preprint arXiv:2203.08382, 2022.
- C. Schuhmann, R. Kaczmarczyk, A. Komatsuzaki, A. Katta, R. Vencu, R. Beaumont, J. Jitsev, T. Coombes, and C. Mullis, “Laion-400m: Open dataset of clip-filtered 400 million image-text pairs,” in NeurIPS Workshop Datacentric AI, no. FZJ-2022-00923. Jülich Supercomputing Center, 2021.
- C. Schuhmann, R. Beaumont, R. Vencu, C. W. Gordon, R. Wightman, M. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsman et al., “Laion-5b: An open large-scale dataset for training next generation image-text models,” in Thirty-sixth Conference on NeurIPS Datasets and Benchmarks Track, 2022.
- R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” in Proceedings of CVPR, 2022, pp. 10 684–10 695.
- J. Zhang, P. Yang, W. Wang, Y. Hong, and L. Zhang, “Image editing via segmentation guided self-attention network,” SPL, vol. 27, pp. 1605–1609, 2020.
- A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford, M. Chen, and I. Sutskever, “Zero-shot text-to-image generation,” in ICML. PMLR, 2021, pp. 8821–8831.
- O. Gafni, A. Polyak, O. Ashual, S. Sheynin, D. Parikh, and Y. Taigman, “Make-a-scene: Scene-based text-to-image generation with human priors,” in ECCV, 2022.
- C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. Denton, S. K. S. Ghasemipour, B. K. Ayan, S. S. Mahdavi, R. G. Lopes et al., “Photorealistic text-to-image diffusion models with deep language understanding,” arXiv preprint arXiv:2205.11487, 2022.
- A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, “Hierarchical text-conditional image generation with clip latents,” arXiv preprint arXiv:2204.06125, 2022.
- G. Couairon, J. Verbeek, H. Schwenk, and M. Cord, “Diffedit: Diffusion-based semantic image editing with mask guidance,” in ICLR, 2023.
- A. Hertz, R. Mokady, J. Tenenbaum, K. Aberman, Y. Pritch, and D. Cohen-Or, “Prompt-to-prompt image editing with cross attention control,” arXiv preprint arXiv:2208.01626, 2022.
- J. Yu, Y. Xu, J. Y. Koh, T. Luong, G. Baid, Z. Wang, V. Vasudevan, A. Ku, Y. Yang, B. K. Ayan et al., “Scaling autoregressive models for content-rich text-to-image generation,” TMLR, 2022.
- Y. Song and S. Ermon, “Generative modeling by estimating gradients of the data distribution,” NeurIPS, vol. 32, 2019.
- Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole, “Score-based generative modeling through stochastic differential equations,” in ICLR, 2020.
- D. Watson, W. Chan, J. Ho, and M. Norouzi, “Learning fast samplers for diffusion models by differentiating through sample quality,” in ICLR, 2021.
- C. Lu, Y. Zhou, F. Bao, J. Chen, C. Li, and J. Zhu, “Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps,” in NeurIPS, 2022.
- Q. Zhang and Y. Chen, “Fast sampling of diffusion models with exponential integrator,” arXiv preprint arXiv:2204.13902, 2022.
- T. Karras, M. Aittala, T. Aila, and S. Laine, “Elucidating the design space of diffusion-based generative models,” in NeurIPS, 2022.
- M. Chen and Z. Liu, “Edbgan: Image inpainting via an edge-aware dual branch generative adversarial network,” SPL, vol. 28, pp. 842–846, 2021.
- S. S. Phutke and S. Murala, “Fasnet: Feature aggregation and sharing network for image inpainting,” SPL, vol. 29, pp. 1664–1668, 2022.
- C. Han and J. Wang, “Face image inpainting with evolutionary generators,” SPL, vol. 28, pp. 190–193, 2021.
- S. S. Phutke and S. Murala, “Diverse receptive field based adversarial concurrent encoder network for image inpainting,” SPL, vol. 28, pp. 1873–1877, 2021.
- X. Xu, Y.-C. Chen, X. Tao, and J. Jia, “Text-guided human image manipulation via image-text shared space,” IEEE TPAMI, 2021.
- H. Zheng, Z. Lin, J. Lu, S. Cohen, J. Zhang, N. Xu, and J. Luo, “Semantic layout manipulation with high-resolution sparse attention,” IEEE TPAMI, 2022.
- G. Luo, Y. Zhu, Z. Weng, and Z. Li, “A disocclusion inpainting framework for depth-based view synthesis,” IEEE TPAMI, vol. 42, no. 6, pp. 1289–1302, 2019.
- S. Xu, D. Liu, and Z. Xiong, “E2i: Generative inpainting from edge to image,” IEEE TCSVT, vol. 31, no. 4, pp. 1308–1322, 2020.
- P. Zhang, L. Yang, X. Xie, and J. Lai, “Lightweight texture correlation network for pose guided person image generation,” IEEE TCSVT, 2021.
- L. Zhang, H. Yang, T. Qiu, and L. Li, “Ap-gan: Improving attribute preservation in video face swapping,” IEEE TCSVT, vol. 32, no. 4, pp. 2226–2237, 2021.
- Y. Liu, Q. Li, Q. Deng, and Z. Sun, “Towards spatially disentangled manipulation of face images with pre-trained stylegans,” IEEE TCSVT, 2022.
- O. Avrahami, D. Lischinski, and O. Fried, “Blended diffusion for text-driven editing of natural images,” in Proceedings of CVPR, 2022, pp. 18 208–18 218.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.