Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

In-Context Learning Unlocked for Diffusion Models (2305.01115v2)

Published 1 May 2023 in cs.CV
In-Context Learning Unlocked for Diffusion Models

Abstract: We present Prompt Diffusion, a framework for enabling in-context learning in diffusion-based generative models. Given a pair of task-specific example images, such as depth from/to image and scribble from/to image, and a text guidance, our model automatically understands the underlying task and performs the same task on a new query image following the text guidance. To achieve this, we propose a vision-language prompt that can model a wide range of vision-language tasks and a diffusion model that takes it as input. The diffusion model is trained jointly over six different tasks using these prompts. The resulting Prompt Diffusion model is the first diffusion-based vision-language foundation model capable of in-context learning. It demonstrates high-quality in-context generation on the trained tasks and generalizes effectively to new, unseen vision tasks with their respective prompts. Our model also shows compelling text-guided image editing results. Our framework aims to facilitate research into in-context learning for computer vision. We share our code and pre-trained models at https://github.com/Zhendong-Wang/Prompt-Diffusion.

Overview of "In-Context Learning Unlocked for Diffusion Models"

The paper "In-Context Learning Unlocked for Diffusion Models" introduces a novel framework called Prompt Diffusion that extends in-context learning capabilities to diffusion-based generative models in computer vision. The research addresses the challenge of integrating vision-language tasks using a unified model capable of performing multiple tasks through a prompt-based approach.

Key Contributions and Methodology

This research presents several meaningful contributions:

  • Vision-Language Prompt Design: The authors propose a vision-language prompt structure that encompasses text guidance, example image pairs, and a query image. This facilitates the model's ability to interpret and perform a variety of vision-language tasks.
  • Model Architecture: The Prompt Diffusion framework is built upon the structure of Stable Diffusion and ControlNet. The architecture processes the vision-language prompts through convolutional encoder layers and leverages CLIP for text encoding, integrating these inputs within a diffusion model setup for image generation.
  • Joint Training Across Tasks: The model is jointly trained on six distinct tasks—three inverse tasks involving image generation from conditions (like depth maps) and three forward tasks that generate conditions from images. This multi-task training approach is crucial for endowing the model with the flexibility and adaptability inherent in in-context learning.

Empirical Evaluation

The paper provides a thorough evaluation of Prompt Diffusion, highlighting several important findings:

  • Performance: The paper demonstrates that Prompt Diffusion performs comparably to independently trained models (such as ControlNet) on specific tasks, while also enabling generalization across a diverse range of tasks.
  • Generalization: The model shows promising generalization capabilities to unseen tasks, such as generating images from scribbles or canny-edge maps, attesting to its robust in-context learning ability.
  • Image Editing: Apart from task-specific generation, Prompt Diffusion also supports text-guided image editing, enabling nuances in image synthesis without extensive additional training.

Numerical Evaluation

The research quantitatively supports its claims through metrics like Fréchet Inception Distance (FID) and Root Mean Square Error (RMSE). These metrics validate the quality and accuracy of image synthesis for inverse and forward tasks, with results showing competitive performance.

Implications and Future Directions

This work provides a significant step in advancing the application of diffusion models in computer vision by harnessing in-context learning principles from NLP. Practically, the framework could serve as a versatile tool for tasks ranging from content generation to interactive editing.

Theoretically, the paper opens avenues for further exploration:

  • Extending Task Diversity: Future work could incorporate a broader array of vision-language tasks, potentially enhancing the adaptability of such models.
  • Scalability: Investigating the scalability of the framework when trained from scratch on larger, more diverse datasets could yield further insights.

Conclusion

The introduction of Prompt Diffusion marks a pioneering effort to adapt in-context learning mechanisms to diffusion-based models in the visual domain. The research achieves notable success in demonstrating multi-task adaptability and potential generalization to new domains. As noted, there remain challenges in expanding training data scope and task diversity, but this work lays a foundational approach for future developments in AI-driven computer vision models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (72)
  1. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
  2. Re-imagine the negative prompt algorithm: Transform 2d diffusion into 3d, alleviate janus problem and beyond. arXiv preprint arXiv:2304.04968, 2023.
  3. Visual prompting via image inpainting. Advances in Neural Information Processing Systems, 35:25005–25017, 2022.
  4. Instructpix2pix: Learning to follow image editing instructions. arXiv preprint arXiv:2211.09800, 2022.
  5. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  6. John Canny. A computational approach to edge detection. IEEE Transactions on pattern analysis and machine intelligence, (6):679–698, 1986.
  7. A unified sequence interface for vision tasks. Advances in Neural Information Processing Systems, 35:31333–31346, 2022.
  8. Uniter: Universal image-text representation learning. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX, pages 104–120. Springer, 2020.
  9. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  10. Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 34:8780–8794, 2021.
  11. Violet: End-to-end video-language transformers with masked visual-token modeling. arXiv preprint arXiv:2111.12681, 2021.
  12. Pair-diffusion: Object-level image editing with structure-and-appearance paired diffusion models. arXiv preprint arXiv:2303.17546, 2023.
  13. CARD: Classification and regression diffusion models. In Advances in Neural Information Processing Systems, 2022.
  14. Decoupling the role of data, attention, and losses in multimodal transformers. Transactions of the Association for Computational Linguistics, 9:570–585, 2021.
  15. Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626, 2022.
  16. Classifier-free diffusion guidance, 2022.
  17. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020.
  18. Noise2music: Text-conditioned music generation with diffusion models. arXiv preprint arXiv:2302.03917, 2023.
  19. Text2video-zero: Text-to-image diffusion models are zero-shot video generators. arXiv preprint arXiv:2303.13439, 2023.
  20. Imagenet classification with deep convolutional neural networks. Communications of the ACM, 60(6):84–90, 2017.
  21. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461, 2019.
  22. Uniformer: Unifying convolution and self-attention for visual recognition. arXiv preprint arXiv:2201.09450, 2022a.
  23. Hero: Hierarchical encoder for video+ language omni-representation pre-training. arXiv preprint arXiv:2005.00200, 2020a.
  24. Diffusion-lm improves controllable text generation. arXiv preprint arXiv:2205.14217, 2022b.
  25. Oscar: Object-semantics aligned pre-training for vision-language tasks. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX 16, pages 121–137. Springer, 2020b.
  26. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. arXiv preprint arXiv:2206.00927, 2022.
  27. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems, 32, 2019.
  28. Sdedit: Image synthesis and editing with stochastic differential equations. CoRR, abs/2108.01073, 2021. URL https://arxiv.org/abs/2108.01073.
  29. Rethinking the role of demonstrations: What makes in-context learning work? arXiv preprint arXiv:2202.12837, 2022.
  30. Null-text inversion for editing real images using guided diffusion models. arXiv preprint arXiv:2211.09794, 2022.
  31. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
  32. Show your work: Scratchpads for intermediate computation with language models. arXiv preprint arXiv:2112.00114, 2021.
  33. OpenAI. OpenAI GPT-4. https://openai.com/gpt-4/, 2022.
  34. Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988, 2022.
  35. Improving language understanding by generative pre-training. 2018.
  36. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  37. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  38. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67, 2020. URL http://jmlr.org/papers/v21/20-074.html.
  39. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
  40. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE transactions on pattern analysis and machine intelligence, 44(3):1623–1637, 2020.
  41. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022.
  42. Learning to retrieve prompts for in-context learning. arXiv preprint arXiv:2112.08633, 2021.
  43. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. arXiv preprint arXiv:2208.12242, 2022.
  44. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022.
  45. What does clip know about a red circle? visual prompt engineering for vlms. arXiv preprint arXiv:2304.06712, 2023.
  46. Flava: A foundational language and vision alignment model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15638–15650, 2022.
  47. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pages 2256–2265. PMLR, 2015.
  48. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
  49. Generative modeling by estimating gradients of the data distribution. Advances in neural information processing systems, 32, 2019.
  50. Vl-bert: Pre-training of generic visual-linguistic representations. arXiv preprint arXiv:1908.08530, 2019.
  51. Videobert: A joint model for video and language representation learning. In Proceedings of the IEEE/CVF international conference on computer vision, pages 7464–7473, 2019.
  52. Lxmert: Learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490, 2019.
  53. POUF: Prompt-oriented unsupervised fine-tuning for large pre-trained models. arXiv preprint arXiv:2305.00350, 2023.
  54. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  55. Ufo: A unified transformer for vision-language representation learning. arXiv preprint arXiv:2111.10023, 2021.
  56. Images speak in images: A generalist painter for in-context visual learning. arXiv preprint arXiv:2212.02499, 2022a.
  57. Seggpt: Segmenting everything in context. arXiv preprint arXiv:2304.03284, 2023a.
  58. Diffusion policies as an expressive policy class for offline reinforcement learning. arXiv preprint arXiv:2208.06193, 2022b.
  59. Diffusion-GAN: Training GANs with diffusion. arXiv preprint arXiv:2206.02262, 2022c.
  60. Patch diffusion: Faster and more data-efficient training of diffusion models. arXiv preprint arXiv:2304.12526, 2023b.
  61. DR2: Diffusion-based robust degradation remover for blind face restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1704–1713, 2023c.
  62. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682, 2022a.
  63. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903, 2022b.
  64. Holistically-nested edge detection. In Proceedings of the IEEE international conference on computer vision, pages 1395–1403, 2015.
  65. An explanation of in-context learning as implicit bayesian inference. arXiv preprint arXiv:2111.02080, 2021.
  66. Neurallift-360: Lifting an in-the-wild 2d photo to a 3d object with 360° views. arXiv e-prints, pages arXiv–2211, 2022a.
  67. Versatile diffusion: Text, images and variations all in one diffusion model. arXiv preprint arXiv:2211.08332, 2022b.
  68. Cpt: Colorful prompt tuning for pre-trained vision-language models. arXiv preprint arXiv:2109.11797, 2021.
  69. Adding conditional control to text-to-image diffusion models. arXiv preprint arXiv:2302.05543, 2023.
  70. What makes good examples for visual in-context learning? arXiv preprint arXiv:2301.13670, 2023.
  71. Learning stackable and skippable lego bricks for efficient, reconfigurable, and variable-resolution diffusion modeling. arXiv preprint arXiv:2310.06389, 2023.
  72. Learning to prompt for vision-language models. International Journal of Computer Vision, 130(9):2337–2348, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Zhendong Wang (60 papers)
  2. Yifan Jiang (79 papers)
  3. Yadong Lu (19 papers)
  4. Yelong Shen (83 papers)
  5. Pengcheng He (60 papers)
  6. Weizhu Chen (128 papers)
  7. Zhangyang Wang (374 papers)
  8. Mingyuan Zhou (161 papers)
Citations (62)
X Twitter Logo Streamline Icon: https://streamlinehq.com