Diffusion Self-Guidance for Controllable Image Generation (2306.00986v3)
Abstract: Large-scale generative models are capable of producing high-quality images from detailed text descriptions. However, many aspects of an image are difficult or impossible to convey through text. We introduce self-guidance, a method that provides greater control over generated images by guiding the internal representations of diffusion models. We demonstrate that properties such as the shape, location, and appearance of objects can be extracted from these representations and used to steer sampling. Self-guidance works similarly to classifier guidance, but uses signals present in the pretrained model itself, requiring no additional models or training. We show how a simple set of properties can be composed to perform challenging image manipulations, such as modifying the position or size of objects, merging the appearance of objects in one image with the layout of another, composing objects from many images into one, and more. We also show that self-guidance can be used to edit real images. For results and an interactive demo, see our project page at https://dave.ml/selfguidance/
- Universal guidance for diffusion models. arXiv preprint arXiv:2302.07121, 2023.
- Text2live: Text-driven layered image and video editing. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XV, pages 707–723. Springer, 2022.
- Large scale gan training for high fidelity natural image synthesis. ArXiv, abs/1809.11096, 2018.
- Instructpix2pix: Learning to follow image editing instructions. arXiv preprint arXiv:2211.09800, 2022.
- Using latent space regression to analyze and leverage compositionality in gans. arXiv preprint arXiv:2103.10426, 2021.
- Training-free layout control with cross-attention guidance. arXiv preprint arXiv:2304.03373, 2023.
- Diffusion models beat gans on image synthesis. ArXiv, abs/2105.05233, 2021.
- Blobgan: Spatially disentangled scene representations. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XV, pages 616–635. Springer, 2022.
- An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618, 2022.
- Diffusion models as plug-and-play priors. arXiv:2206.09012, 2022.
- Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626, 2022.
- Denoising diffusion probabilistic models. NeurIPS, 2020.
- Classifier-free diffusion guidance. arXiv:2207.12598, 2022.
- Imagic: Text-based real image editing with diffusion models. arXiv preprint arXiv:2210.09276, 2022.
- Variational diffusion models. NeurIPS, 2021.
- Diffusion models already have a semantic latent space. arXiv preprint arXiv:2210.10960, 2022.
- Compositional visual generation with composable diffusion models. In European Conference on Computer Vision, 2022.
- Design guidelines for prompt engineering text-to-image generative models. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems, pages 1–23, 2022.
- Challenging common assumptions in the unsupervised learning of disentangled representations. In international conference on machine learning, pages 4114–4124. PMLR, 2019.
- Leca: A learned approach for efficient cover-agnostic watermarking. arXiv preprint arXiv:2206.10813, 2022.
- Sdedit: Guided image synthesis and editing with stochastic differential equations. In International Conference on Learning Representations, 2021.
- Null-text inversion for editing real images using guided diffusion models. arXiv preprint arXiv:2211.09794, 2022.
- Glide: Towards photorealistic image generation and editing with text-guided diffusion models. In ICML, 2022.
- Swapping autoencoder for deep image manipulation. Advances in Neural Information Processing Systems, 33:7198–7211, 2020.
- Diffusion autoencoders: Toward a meaningful and decodable representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10619–10629, 2022.
- Hierarchical text-conditional image generation with clip latents, 2022.
- Zero-shot text-to-image generation. ICML, 2021.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022.
- U-net: Convolutional networks for biomedical image segmentation. ArXiv, abs/1505.04597, 2015.
- Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. arXiv preprint arXiv:2208.12242, 2022.
- Palette: Image-to-image diffusion models. SIGGRAPH, 2022.
- Photorealistic text-to-image diffusion models with deep language understanding. arXiv:2205.11487, 2022.
- Deep unsupervised learning using nonequilibrium thermodynamics. ICML, 2015.
- Denoising diffusion implicit models. CoRR, abs/2010.02502, 2020.
- Score-based generative modeling through stochastic differential equations. ICLR, 2021.
- Plug-and-play diffusion features for text-driven image-to-image translation. arXiv preprint arXiv:2211.12572, 2022.
- Unsupervised discovery of object radiance fields. arXiv preprint arXiv:2107.07905, 2021.
- Scaling autoregressive models for content-rich text-to-image generation. arXiv:2206.10789, 2022.
- Adding conditional control to text-to-image diffusion models. arXiv preprint arXiv:2302.05543, 2023.
- Region-based semantic factorization in gans. arXiv preprint arXiv:2202.09649, 2022.
- Dave Epstein (9 papers)
- Allan Jabri (17 papers)
- Ben Poole (46 papers)
- Aleksander Holynski (37 papers)
- Alexei A. Efros (100 papers)