Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

38 tokens/sec

GPT-4o

59 tokens/sec

Gemini 2.5 Pro Pro

41 tokens/sec

o3 Pro

7 tokens/sec

GPT-4.1 Pro

50 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

1 370

In-Context Learning Unlocked for Diffusion Models (2305.01115v2)

Published 1 May 2023 in cs.CV

Abstract: We present Prompt Diffusion, a framework for enabling in-context learning in diffusion-based generative models. Given a pair of task-specific example images, such as depth from/to image and scribble from/to image, and a text guidance, our model automatically understands the underlying task and performs the same task on a new query image following the text guidance. To achieve this, we propose a vision-language prompt that can model a wide range of vision-language tasks and a diffusion model that takes it as input. The diffusion model is trained jointly over six different tasks using these prompts. The resulting Prompt Diffusion model is the first diffusion-based vision-language foundation model capable of in-context learning. It demonstrates high-quality in-context generation on the trained tasks and generalizes effectively to new, unseen vision tasks with their respective prompts. Our model also shows compelling text-guided image editing results. Our framework aims to facilitate research into in-context learning for computer vision. We share our code and pre-trained models at https://github.com/Zhendong-Wang/Prompt-Diffusion.

PDF HTML Abstract

Overview of "In-Context Learning Unlocked for Diffusion Models"

The paper "In-Context Learning Unlocked for Diffusion Models" introduces a novel framework called Prompt Diffusion that extends in-context learning capabilities to diffusion-based generative models in computer vision. The research addresses the challenge of integrating vision-language tasks using a unified model capable of performing multiple tasks through a prompt-based approach.

Key Contributions and Methodology

This research presents several meaningful contributions:

Vision-Language Prompt Design: The authors propose a vision-language prompt structure that encompasses text guidance, example image pairs, and a query image. This facilitates the model's ability to interpret and perform a variety of vision-language tasks.
Model Architecture: The Prompt Diffusion framework is built upon the structure of Stable Diffusion and ControlNet. The architecture processes the vision-language prompts through convolutional encoder layers and leverages CLIP for text encoding, integrating these inputs within a diffusion model setup for image generation.
Joint Training Across Tasks: The model is jointly trained on six distinct tasks—three inverse tasks involving image generation from conditions (like depth maps) and three forward tasks that generate conditions from images. This multi-task training approach is crucial for endowing the model with the flexibility and adaptability inherent in in-context learning.

Empirical Evaluation

The paper provides a thorough evaluation of Prompt Diffusion, highlighting several important findings:

Performance: The paper demonstrates that Prompt Diffusion performs comparably to independently trained models (such as ControlNet) on specific tasks, while also enabling generalization across a diverse range of tasks.
Generalization: The model shows promising generalization capabilities to unseen tasks, such as generating images from scribbles or canny-edge maps, attesting to its robust in-context learning ability.
Image Editing: Apart from task-specific generation, Prompt Diffusion also supports text-guided image editing, enabling nuances in image synthesis without extensive additional training.

Numerical Evaluation

The research quantitatively supports its claims through metrics like Fréchet Inception Distance (FID) and Root Mean Square Error (RMSE). These metrics validate the quality and accuracy of image synthesis for inverse and forward tasks, with results showing competitive performance.

Implications and Future Directions

This work provides a significant step in advancing the application of diffusion models in computer vision by harnessing in-context learning principles from NLP. Practically, the framework could serve as a versatile tool for tasks ranging from content generation to interactive editing.

Theoretically, the paper opens avenues for further exploration:

Extending Task Diversity: Future work could incorporate a broader array of vision-language tasks, potentially enhancing the adaptability of such models.
Scalability: Investigating the scalability of the framework when trained from scratch on larger, more diverse datasets could yield further insights.

Conclusion

The introduction of Prompt Diffusion marks a pioneering effort to adapt in-context learning mechanisms to diffusion-based models in the visual domain. The research achieves notable success in demonstrating multi-task adaptability and potential generalization to new domains. As noted, there remain challenges in expanding training data scope and task diversity, but this work lays a foundational approach for future developments in AI-driven computer vision models.

PDF Markdown Bookmark Chat (Pro)

References (72)

Authors (8)

Zhendong Wang (60 papers)
Yifan Jiang (79 papers)
Yadong Lu (19 papers)
Yelong Shen (83 papers)
Pengcheng He (60 papers)
Weizhu Chen (128 papers)
Zhangyang Wang (374 papers)
Mingyuan Zhou (161 papers)

Citations (62)

View on Semantic Scholar

GitHub

GitHub - Zhendong-Wang/Prompt-Diffusion: Official PyTorch implementation of the paper "In-Context Learning Unlocked for Diffusion Models" (370 stars)

Tweets

https://twitter.com/1629090437247156230/status/1739330662220632372