Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
194 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Training-free Content Injection using h-space in Diffusion Models (2303.15403v2)

Published 27 Mar 2023 in cs.CV

Abstract: Diffusion models (DMs) synthesize high-quality images in various domains. However, controlling their generative process is still hazy because the intermediate variables in the process are not rigorously studied. Recently, the bottleneck feature of the U-Net, namely $h$-space, is found to convey the semantics of the resulting image. It enables StyleCLIP-like latent editing within DMs. In this paper, we explore further usage of $h$-space beyond attribute editing, and introduce a method to inject the content of one image into another image by combining their features in the generative processes. Briefly, given the original generative process of the other image, 1) we gradually blend the bottleneck feature of the content with proper normalization, and 2) we calibrate the skip connections to match the injected content. Unlike custom-diffusion approaches, our method does not require time-consuming optimization or fine-tuning. Instead, our method manipulates intermediate features within a feed-forward generative process. Furthermore, our method does not require supervision from external networks. The code is available at https://curryjung.github.io/InjectFusion/

Definition Search Book Streamline Icon: https://streamlinehq.com
References (81)
  1. Blended diffusion for text-driven editing of natural images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18208–18218, 2022.
  2. Rethinking the truly unsupervised image-to-image translation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 14154–14163, 2021.
  3. ediffi: Text-to-image diffusion models with an ensemble of expert denoisers. arXiv preprint arXiv:2211.01324, 2022.
  4. Label-efficient semantic segmentation with diffusion models. arXiv preprint arXiv:2112.03126, 2021.
  5. Instructpix2pix: Learning to follow image editing instructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18392–18402, 2023.
  6. Masactrl: Tuning-free mutual self-attention control for consistent image synthesis and editing. arXiv preprint arXiv:2304.08465, 2023.
  7. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021.
  8. Ilvr: Conditioning method for denoising diffusion probabilistic models. arXiv preprint arXiv:2108.02938, 2021.
  9. Perception prioritized training of diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11472–11481, 2022.
  10. Stargan: Unified generative adversarial networks for multi-domain image-to-image translation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8789–8797, 2018.
  11. Stargan v2: Diverse image synthesis for multiple domains. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8188–8197, 2020.
  12. Jojogan: One shot face stylization. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XVI, pages 128–152. Springer, 2022.
  13. Improving diffusion models for inverse problems using manifold constraints. arXiv preprint arXiv:2206.00941, 2022.
  14. Diffedit: Diffusion-based semantic image editing with mask guidance. arXiv preprint arXiv:2210.11427, 2022.
  15. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009.
  16. Arcface: Additive angular margin loss for deep face recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4690–4699, 2019.
  17. Stytr^ 2: Unbiased image style transfer with transformers. arXiv preprint arXiv:2105.14576, 2021.
  18. Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 34:8780–8794, 2021.
  19. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  20. Make-a-scene: Scene-based text-to-image generation with human priors. arXiv preprint arXiv:2203.13131, 2022.
  21. An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618, 2022.
  22. A neural algorithm of artistic style. arXiv preprint arXiv:1508.06576, 2015.
  23. Image style transfer using convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2414–2423, 2016.
  24. Pair-diffusion: Object-level image editing with structure-and-appearance paired diffusion models. arXiv preprint arXiv:2303.17546, 2023.
  25. Generative adversarial networks. Communications of the ACM, 63(11):139–144, 2020.
  26. Highly personalized text embedding for image manipulation by stable diffusion. arXiv preprint arXiv:2303.08767, 2023.
  27. Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626, 2022.
  28. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
  29. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020.
  30. Cycada: Cycle-consistent adversarial domain adaptation. In International conference on machine learning, pages 1989–1998. Pmlr, 2018.
  31. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1125–1134, 2017.
  32. Gotta go fast when generating data with score-based models. arXiv preprint arXiv:2105.14080, 2021.
  33. Progressive growing of gans for improved quality, stability, and variation. In International Conference on Learning Representations, 2018.
  34. Elucidating the design space of diffusion-based generative models. arXiv preprint arXiv:2206.00364, 2022.
  35. Training generative adversarial networks with limited data. Advances in Neural Information Processing Systems, 33:12104–12114, 2020.
  36. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4401–4410, 2019.
  37. Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8110–8119, 2020.
  38. Imagic: Text-based real image editing with diffusion models. arXiv preprint arXiv:2210.09276, 2022.
  39. Diffusionclip: Text-guided image manipulation using diffusion models. 2021.
  40. Exploiting spatial dimensions of latent in gan for real-time image editing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 852–861, 2021.
  41. Multi-concept customization of text-to-image diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1931–1941, 2023.
  42. Diffusion-based image translation using disentangled style and content representation. arXiv preprint arXiv:2209.15264, 2022.
  43. Diffusion models already have a semantic latent space. arXiv preprint arXiv:2210.10960, 2022.
  44. Gligen: Open-set grounded text-to-image generation. arXiv preprint arXiv:2301.07093, 2023.
  45. Pseudo numerical methods for diffusion models on manifolds. arXiv preprint arXiv:2202.09778, 2022.
  46. More control for free! image synthesis with semantic diffusion guidance. arXiv preprint arXiv:2112.05744, 2021.
  47. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. arXiv preprint arXiv:2206.00927, 2022.
  48. Repaint: Inpainting using denoising diffusion probabilistic models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11461–11471, 2022.
  49. Sdedit: Image synthesis and editing with stochastic differential equations. arXiv preprint arXiv:2108.01073, 2021.
  50. Null-text inversion for editing real images using guided diffusion models. arXiv preprint arXiv:2211.09794, 2022.
  51. Null-text inversion for editing real images using guided diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6038–6047, 2023.
  52. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453, 2023.
  53. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
  54. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, pages 8162–8171. PMLR, 2021.
  55. Semantic image synthesis with spatially-adaptive normalization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2337–2346, 2019.
  56. Swapping autoencoder for deep image manipulation. Advances in Neural Information Processing Systems, 33:7198–7211, 2020.
  57. Unsupervised discovery of semantic latent directions in diffusion models. arXiv preprint arXiv:2302.12469, 2023.
  58. Zero-shot image-to-image translation. arXiv preprint arXiv:2302.03027, 2023.
  59. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
  60. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022.
  61. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. 2022.
  62. Aladin-nst: Self-supervised disentangled representation learning of artistic style through neural style transfer. arXiv preprint arXiv:2304.05755, 2023.
  63. Diff-nst: Diffusion interleaving for deformable neural style transfer. arXiv preprint arXiv:2307.04157, 2023.
  64. Simo Ryu. Low-rank adaptation for fast text-to-image diffusion fine-tuning. https://github.com/cloneofsimo/lora, 2023.
  65. Photorealistic text-to-image diffusion models with deep language understanding. arXiv preprint arXiv:2205.11487, 2022.
  66. Generating high fidelity data from low-density regions using diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11492–11501, 2022.
  67. Styledrop: Text-to-image generation in any style. arXiv preprint arXiv:2306.00983, 2023.
  68. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
  69. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020.
  70. Parasol: Parametric style control for diffusion image synthesis. arXiv preprint arXiv:2303.06464, 2023.
  71. Plug-and-play diffusion features for text-driven image-to-image translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1921–1930, 2023.
  72. Edict: Exact diffusion inversion via coupled transformations. arXiv preprint arXiv:2211.12446, 2022.
  73. High-resolution image synthesis and semantic manipulation with conditional gans. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8798–8807, 2018.
  74. Learning fast samplers for diffusion models by differentiating through sample quality. In International Conference on Learning Representations, 2022.
  75. Ccpl: Contrastive coherence preserving loss for versatile style transfer. In European Conference on Computer Vision, pages 189–206. Springer, 2022.
  76. Smartbrush: Text and shape guided object inpainting with diffusion model. arXiv preprint arXiv:2212.05034, 2022.
  77. Paint by example: Exemplar-based image editing with diffusion models. arXiv preprint arXiv:2211.13227, 2022.
  78. Photorealistic style transfer via wavelet transforms. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9036–9045, 2019.
  79. Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365, 2015.
  80. Adding conditional control to text-to-image diffusion models. arXiv preprint arXiv:2302.05543, 2023.
  81. Boundary guided mixing trajectory for semantic control with diffusion models. arXiv preprint arXiv:2302.08357, 2023.
Citations (16)

Summary

We haven't generated a summary for this paper yet.

Github Logo Streamline Icon: https://streamlinehq.com

GitHub