Contextualized Diffusion Models for Text-Guided Image and Video Generation (2402.16627v3)
Abstract: Conditional diffusion models have exhibited superior performance in high-fidelity text-guided visual generation and editing. Nevertheless, prevailing text-guided visual diffusion models primarily focus on incorporating text-visual relationships exclusively into the reverse process, often disregarding their relevance in the forward process. This inconsistency between forward and reverse processes may limit the precise conveyance of textual semantics in visual synthesis results. To address this issue, we propose a novel and general contextualized diffusion model (ContextDiff) by incorporating the cross-modal context encompassing interactions and alignments between text condition and visual sample into forward and reverse processes. We propagate this context to all timesteps in the two processes to adapt their trajectories, thereby facilitating cross-modal conditional modeling. We generalize our contextualized diffusion to both DDPMs and DDIMs with theoretical derivations, and demonstrate the effectiveness of our model in evaluations with two challenging tasks: text-to-image generation, and text-to-video editing. In each task, our ContextDiff achieves new state-of-the-art performance, significantly enhancing the semantic alignment between text condition and generated samples, as evidenced by quantitative and qualitative evaluations. Our code is available at https://github.com/YangLing0818/ContextDiff
- ediffi: Text-to-image diffusion models with an ensemble of expert denoisers. arXiv preprint arXiv:2211.01324, 2022.
- Text2live: Text-driven layered image and video editing. In European Conference on Computer Vision, pp. 707–723. Springer, 2022.
- Large scale gan training for high fidelity natural image synthesis. In International Conference on Learning Representations, 2018.
- Muse: Text-to-image generation via masked generative transformers. arXiv preprint arXiv:2301.00704, 2023.
- Diffusion models beat gans on image synthesis. In Advances in Neural Information Processing Systems, volume 34, pp. 8780–8794, 2021.
- Score-based generative modeling with critically-damped langevin diffusion. In International Conference on Learning Representations, 2021.
- Structure and content-guided video synthesis with diffusion models. arXiv preprint arXiv:2302.03011, 2023.
- Frido: Feature pyramid diffusion for complex scene image synthesis. In The AAAI Conference on Artificial Intelligence, volume 37, pp. 579–587, 2023.
- Ernie-vilg 2.0: Improving text-to-image diffusion model with knowledge-enhanced mixture-of-denoising-experts. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 10135–10145, 2023.
- Make-a-scene: Scene-based text-to-image generation with human priors. arXiv preprint arXiv:2203.13131, 2022.
- Clipscore: A reference-free evaluation metric for image captioning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 7514–7528, 2021.
- Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems, volume 33, pp. 6840–6851, 2020.
- Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022.
- Blurring diffusion models. In The Eleventh International Conference on Learning Representations, 2022.
- simple diffusion: End-to-end diffusion for high resolution images. arXiv preprint arXiv:2301.11093, 2023.
- High-resolution complex scene synthesis with transformers. arXiv preprint arXiv:2105.06458, 2021.
- Elucidating the design space of diffusion-based generative models. Advances in Neural Information Processing Systems, 35:26565–26577, 2022.
- Text2video-zero: Text-to-image diffusion models are zero-shot video generators. arXiv preprint arXiv:2303.13439, 2023.
- Maximum likelihood training of implicit nonlinear diffusion model. Advances in Neural Information Processing Systems, 35:32270–32284, 2022.
- Variational diffusion models. In Advances in Neural Information Processing Systems, volume 34, pp. 21696–21707, 2021.
- Priorgrad: Improving conditional denoising diffusion models with data-dependent adaptive prior. In International Conference on Learning Representations, 2021.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International Conference on Machine Learning, 2023.
- Video-p2p: Video editing with cross-attention control. arXiv preprint arXiv:2303.04761, 2023.
- Flow straight and fast: Learning to generate and transfer data with rectified flow. In The Eleventh International Conference on Learning Representations, 2022.
- Decoupled weight decay regularization. In International Conference on Learning Representations, 2018.
- Dreamix: Video diffusion models are general video editors. arXiv preprint arXiv:2302.01329, 2023.
- Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, pp. 8162–8171, 2021.
- Glide: Towards photorealistic image generation and editing with text-guided diffusion models. In International Conference on Machine Learning, pp. 16784–16804. PMLR, 2022a.
- GLIDE: Towards photorealistic image generation and editing with text-guided diffusion models. In International Conference on Machine Learning, pp. 16784–16804, 2022b.
- Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.
- The 2017 davis challenge on video object segmentation. arXiv preprint arXiv:1704.00675, 2017.
- Grad-tts: A diffusion probabilistic model for text-to-speech. In International Conference on Machine Learning, pp. 8599–8608. PMLR, 2021.
- Fatezero: Fusing attentions for zero-shot text-based video editing. In IEEE International Conference on Computer Vision, 2023.
- Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pp. 8748–8763, 2021.
- Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21(140):1–67, 2020.
- Zero-shot text-to-image generation. In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp. 8821–8831. PMLR, 18–24 Jul 2021.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
- High-resolution image synthesis with latent diffusion models. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 10684–10695, 2022.
- Palette: Image-to-image diffusion models. In ACM SIGGRAPH 2022 Conference Proceedings, pp. 1–10, 2022a.
- Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022b.
- Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021.
- Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pp. 2256–2265, 2015.
- Denoising diffusion implicit models. In International Conference on Learning Representations, 2020.
- Generative modeling by estimating gradients of the data distribution. In Advances in Neural Information Processing Systems, volume 32, 2019.
- Improved vector quantized diffusion models. arXiv preprint arXiv:2205.16007, 2022.
- Df-gan: A simple and effective baseline for text-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16515–16525, 2022.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Zero-shot video editing using off-the-shelf image diffusion models. arXiv preprint arXiv:2303.17599, 2023.
- Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. arXiv preprint arXiv:2212.11565, 2022.
- Raphael: Text-to-image generation via large mixture of diffusion paths. arXiv preprint arXiv:2305.18295, 2023.
- Improving diffusion-based image synthesis with context prediction. In Advances in Neural Information Processing Systems, 2023a.
- Diffusion models: A comprehensive survey of methods and applications. ACM Computing Surveys, 56(4):1–39, 2023b.
- Mastering text-to-image diffusion: Recaptioning, planning, and generating with multimodal llms. arXiv preprint arXiv:2401.11708, 2024.
- Improving text-to-image synthesis using contrastive learning. arXiv preprint arXiv:2107.02423, 2021.
- Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789, 2022.
- Diffusion normalizing flow. In Advances in Neural Information Processing Systems, volume 34, pp. 16280–16291, 2021.
- Controlvideo: Adding conditional control for one shot text-to-video editing. arXiv preprint arXiv:2305.17098, 2023.
- Towards language-free training for text-to-image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 17907–17917, 2022.
- Ling Yang (88 papers)
- Zhilong Zhang (20 papers)
- Zhaochen Yu (7 papers)
- Jingwei Liu (49 papers)
- Minkai Xu (40 papers)
- Stefano Ermon (279 papers)
- Bin Cui (165 papers)