Add-SD: Rational Generation without Manual Reference (2407.21016v1)
Abstract: Diffusion models have exhibited remarkable prowess in visual generalization. Building on this success, we introduce an instruction-based object addition pipeline, named Add-SD, which automatically inserts objects into realistic scenes with rational sizes and positions. Different from layout-conditioned methods, Add-SD is solely conditioned on simple text prompts rather than any other human-costly references like bounding boxes. Our work contributes in three aspects: proposing a dataset containing numerous instructed image pairs; fine-tuning a diffusion model for rational generation; and generating synthetic data to boost downstream tasks. The first aspect involves creating a RemovalDataset consisting of original-edited image pairs with textual instructions, where an object has been removed from the original image while maintaining strong pixel consistency in the background. These data pairs are then used for fine-tuning the Stable Diffusion (SD) model. Subsequently, the pretrained Add-SD model allows for the insertion of expected objects into an image with good rationale. Additionally, we generate synthetic instances for downstream task datasets at scale, particularly for tail classes, to alleviate the long-tailed problem. Downstream tasks benefit from the enriched dataset with enhanced diversity and rationale. Experiments on LVIS val demonstrate that Add-SD yields an improvement of 4.3 mAP on rare classes over the baseline. Code and models are available at https://github.com/ylingfeng/Add-SD.
- Denoising diffusion probabilistic models. NeurIPS, 2020.
- Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
- High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
- Photorealistic text-to-image diffusion models with deep language understanding. NeurIPS, 2022.
- Zero-shot text-to-image generation. In ICML, 2021.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
- Improving image generation with better captions. Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, 2023.
- Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.
- Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
- Sdedit: Guided image synthesis and editing with stochastic differential equations. arXiv preprint arXiv:2108.01073, 2021.
- Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626, 2022.
- Instructpix2pix: Learning to follow image editing instructions. In CVPR, 2023.
- Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In CVPR, 2023.
- T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453, 2023.
- Adding conditional control to text-to-image diffusion models. In ICCV, 2023.
- Photomaker: Customizing realistic human photos via stacked id embedding. arXiv preprint arXiv:2312.04461, 2023.
- Dreamllm: Synergistic multimodal comprehension and creation. arXiv preprint arXiv:2309.11499, 2023.
- Next-gpt: Any-to-any multimodal llm. arXiv preprint arXiv:2309.05519, 2023.
- Generative multimodal models are in-context learners. arXiv preprint arXiv:2312.13286, 2023.
- Plug-and-play diffusion features for text-driven image-to-image translation. In CVPR, 2023.
- Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
- Blended diffusion for text-driven editing of natural images. In CVPR, 2022.
- Diffedit: Diffusion-based semantic image editing with mask guidance. arXiv preprint arXiv:2210.11427, 2022.
- Paint by example: Exemplar-based image editing with diffusion models. In CVPR, 2023.
- Gligen: Open-set grounded text-to-image generation. In CVPR, 2023.
- Anydoor: Zero-shot object-level image customization. arXiv preprint arXiv:2307.09481, 2023.
- X-paste: Revisiting scalable copy-paste for instance segmentation using clip and stablediffusion. In ICML, 2023.
- Gen2det: Generate to detect. arXiv preprint arXiv:2312.04566, 2023.
- Guiding instruction-based image editing via multimodal large language models. arXiv preprint arXiv:2309.17102, 2023.
- OpenAI. Chatgpt. https://openai.com/chatgpt, 2024.
- Simple copy-paste is a strong data augmentation method for instance segmentation. In CVPR, 2021.
- Probabilistic two-stage detection. arXiv preprint arXiv:2103.07461, 2021.
- Imagic: Text-based real image editing with diffusion models. In CVPR, 2023.
- An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618, 2022.
- Fastcomposer: Tuning-free multi-subject image generation with localized attention. arXiv preprint arXiv:2305.10431, 2023.
- Inpaint anything: Segment anything meets image inpainting. arXiv preprint arXiv:2304.06790, 2023.
- Segment anything. arXiv preprint arXiv:2304.02643, 2023.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- Tf-icon: Diffusion-based training-free cross-domain image composition. In ICCV, 2023.
- Visual instruction tuning. NeurIPS, 2024.
- Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824, 2023.
- Glipv2: Unifying localization and vision-language understanding. NeurIPS, 2022.
- Generative pretraining in multimodality. arXiv preprint arXiv:2307.05222, 2023.
- Magicbrush: A manually annotated dataset for instruction-guided image editing. NeurIPS, 2024.
- Emu edit: Precise image editing via recognition and generation tasks. arXiv preprint arXiv:2311.10089, 2023.
- Generating images with multimodal language models. NeurIPS, 2024.
- Language models are few-shot learners. NeurIPS, 2020.
- Synthetic data from diffusion models improves imagenet classification. arXiv preprint arXiv:2304.08466, 2023.
- Is synthetic data from generative models ready for image recognition? arXiv preprint arXiv:2210.07574, 2022.
- Fake it till you make it: Learning transferable representations from synthetic imagenet clones. In CVPR, 2023.
- Imagenet: A large-scale hierarchical image database. In CVPR, 2009.
- The inaturalist species classification and detection dataset. In CVPR, 2018.
- Cut, paste and learn: Surprisingly easy synthesis for instance detection. In ICCV, 2017.
- Modeling visual context is key to augmenting object detection datasets. In ECCV, 2018.
- Resolution-robust large mask inpainting with fourier convolutions. arXiv preprint arXiv:2109.07161, 2021.
- U-net: Convolutional networks for biomedical image segmentation. In MICCAI, 2015.
- Pretraining is all you need for image-to-image translation. arXiv:2205.12952, 2022.
- Microsoft coco: Common objects in context. In ECCV, 2014.
- Lvis: A dataset for large vocabulary instance segmentation. In CVPR, 2019.
- Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023.
- Visual genome: Connecting language and vision using crowdsourced dense image annotations. IJCV, 2017.
- Modeling context in referring expressions. In ECCV, 2016.
- Generation and comprehension of unambiguous object descriptions. In CVPR, 2016.
- Mask r-cnn. In ICCV, 2017.
- Faster r-cnn: Towards real-time object detection with region proposal networks. NeurIPS, 2015.
- Deep residual learning for image recognition. In CVPR, 2016.
- Detectron2. https://github.com/facebookresearch/detectron2, 2019.