Instruct-Imagen: Image Generation with Multi-modal Instruction (2401.01952v1)
Abstract: This paper presents instruct-imagen, a model that tackles heterogeneous image generation tasks and generalizes across unseen tasks. We introduce multi-modal instruction for image generation, a task representation articulating a range of generation intents with precision. It uses natural language to amalgamate disparate modalities (e.g., text, edge, style, subject, etc.), such that abundant generation intents can be standardized in a uniform format. We then build instruct-imagen by fine-tuning a pre-trained text-to-image diffusion model with a two-stage framework. First, we adapt the model using the retrieval-augmented training, to enhance model's capabilities to ground its generation on external multimodal context. Subsequently, we fine-tune the adapted model on diverse image generation tasks that requires vision-language understanding (e.g., subject-driven generation, etc.), each paired with a multi-modal instruction encapsulating the task's essence. Human evaluation on various image generation datasets reveals that instruct-imagen matches or surpasses prior task-specific models in-domain and demonstrates promising generalization to unseen and more complex tasks.
- Flamingo: a visual language model for few-shot learning. In NeurIPS, 2022.
- PaLm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
- Improving image generation with better captions. Technical Report, 2023.
- VisIT-Bench: A benchmark for vision-language instruction following inspired by real-world use. arXiv preprint arXiv:2308.06595, 2023.
- Muse: Text-to-image generation via masked generative transformers. arXiv preprint arXiv:2301.00704, 2023.
- Re-Imagen: Retrieval-augmented text-to-image generator. arXiv preprint arXiv:2209.14491, 2022a.
- Subject-driven text-to-image generation via apprenticeship learning. In NeurIPS, 2023a.
- PaLI: A jointly-scaled multilingual language-image model. In ICLR, 2022b.
- Pali-x: On scaling up a multilingual vision and language model. arXiv preprint arXiv:2305.18565, 2023b.
- Can pre-trained vision and language models answer visual information-seeking questions? arXiv preprint arXiv:2302.11713, 2023c.
- Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
- Diffusion models beat gans on image synthesis. In NeurIPS, 2021.
- Exploring the structure of a real-time, arbitrary neural artistic stylization network. In BMVC, 2017.
- Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In CVPR, 2017.
- Denoising diffusion probabilistic models. NeruIPS, 2020.
- Cascaded diffusion models for high fidelity image generation. JMLR, 2022.
- Open-domain visual entity recognition: Towards recognizing millions of wikipedia entities. arXiv preprint arXiv:2302.11154, 2023.
- Taming encoder for zero fine-tuning image customization with text-to-image diffusion models. arXiv preprint arXiv:2304.02642, 2023.
- Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196, 2017.
- ImagenHub: Standardizing the evaluation of conditional image generation models. arXiv preprint arXiv:2310.01596, 2023.
- Multi-concept customization of text-to-image diffusion. In CVPR, 2023.
- BLIP-Diffusion: Pre-trained subject representation for controllable text-to-image generation and editing. arXiv preprint arXiv:2305.14720, 2023.
- Photo-Sketching: Inferring contour drawings from images. In WACV, 2019.
- Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023.
- Large-scale celebfaces attributes (CelebA) dataset. Retrieved August, 2018.
- OK-VQA: A visual question answering benchmark requiring external knowledge. In CVPR, 2019.
- OpenAI. GPT-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Kosmos-G: Generating images in context with multimodal large language models. arXiv preprint arXiv:2310.02992, 2023.
- SDXL: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.
- Basnet: Boundary-aware salient object detection. In CVPR, 2019.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR, 2020.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
- Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. TPAMI, 2020.
- High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
- U-Net: Convolutional networks for biomedical image segmentation. In MICCAI, 2015.
- DreamBooth: Fine tuning text-to-image diffusion models for subject-driven generation. In CVPR, 2023.
- Photorealistic text-to-image diffusion models with deep language understanding. In NeurIPS, 2022.
- Progressive distillation for fast sampling of diffusion models. arXiv preprint arXiv:2202.00512, 2022.
- Adafactor: Adaptive learning rates with sublinear memory cost. In ICML, 2018.
- Deep unsupervised learning using nonequilibrium thermodynamics. In ICML, 2015.
- StyleDrop: Text-to-image generation in any style. arXiv preprint arXiv:2306.00983, 2023.
- Nima: Neural image assessment. TIP, 2018.
- Improved artgan for conditional synthesis of natural image and artwork. TIP, 2019.
- Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652, 2021.
- Holistically-nested edge detection. In ICCV, 2015.
- Vector-quantized image modeling with improved vqgan. arXiv preprint arXiv:2110.04627, 2021.
- Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789, 2022.
- MagicBrush: A manually annotated dataset for instruction-guided image editing. arXiv preprint arXiv:2306.10012, 2023a.
- Adding conditional control to text-to-image diffusion models. In ICCV, 2023b.
- Instruction tuning for large language models: A survey. arXiv preprint arXiv:2308.10792, 2023c.
- Multimodal c4: An open, billion-scale corpus of images interleaved with text. arXiv preprint arXiv:2304.06939, 2023.
- Hexiang Hu (48 papers)
- Kelvin C. K. Chan (34 papers)
- Yu-Chuan Su (22 papers)
- Wenhu Chen (134 papers)
- Yandong Li (38 papers)
- Kihyuk Sohn (54 papers)
- Yang Zhao (382 papers)
- Xue Ben (3 papers)
- Boqing Gong (100 papers)
- William Cohen (11 papers)
- Ming-Wei Chang (44 papers)
- Xuhui Jia (22 papers)