Emergent Mind

InstructPix2Pix: Learning to Follow Image Editing Instructions

Published Nov 17, 2022 in cs.CV , cs.AI , cs.CL , cs.GR , and cs.LG


We propose a method for editing images from human instructions: given an input image and a written instruction that tells the model what to do, our model follows these instructions to edit the image. To obtain training data for this problem, we combine the knowledge of two large pretrained models -- a language model (GPT-3) and a text-to-image model (Stable Diffusion) -- to generate a large dataset of image editing examples. Our conditional diffusion model, InstructPix2Pix, is trained on our generated data, and generalizes to real images and user-written instructions at inference time. Since it performs edits in the forward pass and does not require per example fine-tuning or inversion, our model edits images quickly, in a matter of seconds. We show compelling editing results for a diverse collection of input images and written instructions.
Method involves generating dataset with GPT-3 and training diffusion model for image editing.


  • InstructPix2Pix is a novel method developed by UC Berkeley researchers using conditional diffusion models for instruction-based image editing.

  • The method utilizes GPT-3 and Stable Diffusion models to create a large training dataset of over 450,000 image editing examples from text instructions.

  • The model showcases superior performance in various editing tasks like style transformation, object replacement, and scenario alteration with high fidelity and realism.

  • Though promising, InstructPix2Pix faces challenges with edits requiring spatial reasoning and complex interactions, and ethical concerns arise from potential biases.

InstructPix2Pix: A Novel Approach for Instruction-Based Image Editing through Diffusion Models


Researchers at UC Berkeley have developed a novel method, dubbed InstructPix2Pix, that employs conditional diffusion models to edit images based on textual instructions. This development addresses the challenge of acquiring scaled and precisely paired training data for instruction-based image editing tasks. By ingeniously combining the capabilities of two pre-existing models—GPT-3 for language understanding and Stable Diffusion for text-to-image synthesis—they generate a large dataset of image editing examples. This dataset trains InstructPix2Pix, enabling it to apply edits to real images using user-provided instructions without necessitating example-based fine-tuning or inversion processes.

Generating a Multi-modal Training Dataset

The process to create a training dataset begins with GPT-3 generating text-based instructions and accompanying edited captions from an initial image description. Following this, Stable Diffusion, alongside the Prompt-to-Prompt method, translates these text pairs into corresponding image pairs. This innovative coupling of text and image generative models allows for the creation of a diverse training dataset encompassing over 450,000 examples. It's noteworthy that the dataset creation method employs several filtering mechanisms, including using CLIP embeddings to ensure relevance and consistency between generated image pairs and their respective captions.

Model Architecture and Training

InstructPix2Pix operates on the principles of conditional diffusion models. These models are known for their prowess in image synthesis tasks. The architecture is particularly interesting because it incorporates mechanisms to ingest both the target editing instruction and the reference image, directly outputting the edited image. The inclusion of Classifier-Free Guidance across two conditional inputs—image and text—allows fine control over the extent to which the generated images adhere to the input image and editing instructions respectively.

Performance and Comparison

The efficacy of InstructPix2Pix is reflected in its ability to handle a broad spectrum of editing tasks, including style transformation, object replacement, and scenario alteration, with a high degree of fidelity and realism. When qualitatively compared to existing methods such as SDEdit and Text2Live, InstructPix2Pix demonstrates superior capability in maintaining image consistency while accurately applying the desired edits. Quantitative assessments further validate its performances, specifically highlighting its advantageous trade-off between edit accuracy and image consistency.

Limitations and Future Work

Despite its promising capabilities, InstructPix2Pix is not without limitations. Challenges remain in handling edits that require spatial reasoning, counting, or complex object interactions. Furthermore, the generated dataset's quality, directly tied to the pre-existing models' limitations, implicitly caps the system's performance. Potential biases inherited from training data and foundational models also present ethical and operational concerns that warrant careful consideration and future remediation efforts.


InstructPix2Pix represents a significant step forward in the domain of instruction-based image editing. By leveraging the strengths of large language and image generative models, it opens up new possibilities for intuitive and detailed image manipulation. The approach not only showcases the potential of combining disparate AI models for task-specific training data generation but also sets a precedent for future developments in generative AI and its application in creative domains. As this technology evolves, it holds promise for broadening the accessibility and versatility of image editing, offering both practical and artistic opportunities.

Get summaries of trending AI papers delivered straight to your inbox

Unsubscribe anytime.