Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
167 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Add-SD: Rational Generation without Manual Reference (2407.21016v1)

Published 30 Jul 2024 in cs.CV

Abstract: Diffusion models have exhibited remarkable prowess in visual generalization. Building on this success, we introduce an instruction-based object addition pipeline, named Add-SD, which automatically inserts objects into realistic scenes with rational sizes and positions. Different from layout-conditioned methods, Add-SD is solely conditioned on simple text prompts rather than any other human-costly references like bounding boxes. Our work contributes in three aspects: proposing a dataset containing numerous instructed image pairs; fine-tuning a diffusion model for rational generation; and generating synthetic data to boost downstream tasks. The first aspect involves creating a RemovalDataset consisting of original-edited image pairs with textual instructions, where an object has been removed from the original image while maintaining strong pixel consistency in the background. These data pairs are then used for fine-tuning the Stable Diffusion (SD) model. Subsequently, the pretrained Add-SD model allows for the insertion of expected objects into an image with good rationale. Additionally, we generate synthetic instances for downstream task datasets at scale, particularly for tail classes, to alleviate the long-tailed problem. Downstream tasks benefit from the enriched dataset with enhanced diversity and rationale. Experiments on LVIS val demonstrate that Add-SD yields an improvement of 4.3 mAP on rare classes over the baseline. Code and models are available at https://github.com/ylingfeng/Add-SD.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (67)
  1. Denoising diffusion probabilistic models. NeurIPS, 2020.
  2. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
  3. High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
  4. Photorealistic text-to-image diffusion models with deep language understanding. NeurIPS, 2022.
  5. Zero-shot text-to-image generation. In ICML, 2021.
  6. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
  7. Improving image generation with better captions. Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, 2023.
  8. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.
  9. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
  10. Sdedit: Guided image synthesis and editing with stochastic differential equations. arXiv preprint arXiv:2108.01073, 2021.
  11. Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626, 2022.
  12. Instructpix2pix: Learning to follow image editing instructions. In CVPR, 2023.
  13. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In CVPR, 2023.
  14. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453, 2023.
  15. Adding conditional control to text-to-image diffusion models. In ICCV, 2023.
  16. Photomaker: Customizing realistic human photos via stacked id embedding. arXiv preprint arXiv:2312.04461, 2023.
  17. Dreamllm: Synergistic multimodal comprehension and creation. arXiv preprint arXiv:2309.11499, 2023.
  18. Next-gpt: Any-to-any multimodal llm. arXiv preprint arXiv:2309.05519, 2023.
  19. Generative multimodal models are in-context learners. arXiv preprint arXiv:2312.13286, 2023.
  20. Plug-and-play diffusion features for text-driven image-to-image translation. In CVPR, 2023.
  21. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
  22. Blended diffusion for text-driven editing of natural images. In CVPR, 2022.
  23. Diffedit: Diffusion-based semantic image editing with mask guidance. arXiv preprint arXiv:2210.11427, 2022.
  24. Paint by example: Exemplar-based image editing with diffusion models. In CVPR, 2023.
  25. Gligen: Open-set grounded text-to-image generation. In CVPR, 2023.
  26. Anydoor: Zero-shot object-level image customization. arXiv preprint arXiv:2307.09481, 2023.
  27. X-paste: Revisiting scalable copy-paste for instance segmentation using clip and stablediffusion. In ICML, 2023.
  28. Gen2det: Generate to detect. arXiv preprint arXiv:2312.04566, 2023.
  29. Guiding instruction-based image editing via multimodal large language models. arXiv preprint arXiv:2309.17102, 2023.
  30. OpenAI. Chatgpt. https://openai.com/chatgpt, 2024.
  31. Simple copy-paste is a strong data augmentation method for instance segmentation. In CVPR, 2021.
  32. Probabilistic two-stage detection. arXiv preprint arXiv:2103.07461, 2021.
  33. Imagic: Text-based real image editing with diffusion models. In CVPR, 2023.
  34. An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618, 2022.
  35. Fastcomposer: Tuning-free multi-subject image generation with localized attention. arXiv preprint arXiv:2305.10431, 2023.
  36. Inpaint anything: Segment anything meets image inpainting. arXiv preprint arXiv:2304.06790, 2023.
  37. Segment anything. arXiv preprint arXiv:2304.02643, 2023.
  38. Learning transferable visual models from natural language supervision. In ICML, 2021.
  39. Tf-icon: Diffusion-based training-free cross-domain image composition. In ICCV, 2023.
  40. Visual instruction tuning. NeurIPS, 2024.
  41. Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824, 2023.
  42. Glipv2: Unifying localization and vision-language understanding. NeurIPS, 2022.
  43. Generative pretraining in multimodality. arXiv preprint arXiv:2307.05222, 2023.
  44. Magicbrush: A manually annotated dataset for instruction-guided image editing. NeurIPS, 2024.
  45. Emu edit: Precise image editing via recognition and generation tasks. arXiv preprint arXiv:2311.10089, 2023.
  46. Generating images with multimodal language models. NeurIPS, 2024.
  47. Language models are few-shot learners. NeurIPS, 2020.
  48. Synthetic data from diffusion models improves imagenet classification. arXiv preprint arXiv:2304.08466, 2023.
  49. Is synthetic data from generative models ready for image recognition? arXiv preprint arXiv:2210.07574, 2022.
  50. Fake it till you make it: Learning transferable representations from synthetic imagenet clones. In CVPR, 2023.
  51. Imagenet: A large-scale hierarchical image database. In CVPR, 2009.
  52. The inaturalist species classification and detection dataset. In CVPR, 2018.
  53. Cut, paste and learn: Surprisingly easy synthesis for instance detection. In ICCV, 2017.
  54. Modeling visual context is key to augmenting object detection datasets. In ECCV, 2018.
  55. Resolution-robust large mask inpainting with fourier convolutions. arXiv preprint arXiv:2109.07161, 2021.
  56. U-net: Convolutional networks for biomedical image segmentation. In MICCAI, 2015.
  57. Pretraining is all you need for image-to-image translation. arXiv:2205.12952, 2022.
  58. Microsoft coco: Common objects in context. In ECCV, 2014.
  59. Lvis: A dataset for large vocabulary instance segmentation. In CVPR, 2019.
  60. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023.
  61. Visual genome: Connecting language and vision using crowdsourced dense image annotations. IJCV, 2017.
  62. Modeling context in referring expressions. In ECCV, 2016.
  63. Generation and comprehension of unambiguous object descriptions. In CVPR, 2016.
  64. Mask r-cnn. In ICCV, 2017.
  65. Faster r-cnn: Towards real-time object detection with region proposal networks. NeurIPS, 2015.
  66. Deep residual learning for image recognition. In CVPR, 2016.
  67. Detectron2. https://github.com/facebookresearch/detectron2, 2019.

Summary

  • The paper introduces a novel diffusion model that generates and adds objects based solely on text, eliminating manual reference inputs.
  • The approach leverages object removal for dataset creation, fine-tunes the Stable Diffusion framework, and uses synthetic data for robust training.
  • The model achieves a 4.3 mAP improvement on rare classes in LVIS, highlighting its potential for efficient data augmentation and realistic image editing.

An Analytical Overview of Add-SD: Rational Generation without Manual Reference

The paper "Add-SD: Rational Generation without Manual Reference" introduces an innovative approach to object addition in images by employing a diffusion model titled Add-SD. This work builds on the capabilities of diffusion models for visual generalization, specifically targeting the insertion of objects into realistic scenes using only textual instructions. The novelty lies in eliminating the need for manual references such as bounding boxes, which are typically costly and labor-intensive for similar image editing tasks.

Methodological Framework

The Add-SD model is developed through a structured methodology encompassing three main stages:

  1. Dataset Creation via Object Removal: The authors present a unique strategy of constructing original-edited image pairs by removing objects from real images. The method leverages the LaMa inpainting model to maintain background consistency, effectively treating the altered images as "original" and the actual images as "edited" for training purposes. Instructions for object addition are crafted into templates using tools like ChatGPT, enhancing the dataset's versatility.
  2. Diffusion Model Fine-tuning: The Stable Diffusion framework is adapted for the object addition task by fine-tuning it with the RemovalDataset. This process transforms the model into an instruction-based generator capable of adding specified objects with appropriate attributes and positioning solely from text input. This capability is intended to make scene augmentation both practical and efficient.
  3. Synthetic Data Generation: To address specific challenges in downstream tasks such as object detection and segmentation, the paper incorporates synthetic data generated by Add-SD. These data augmentations are particularly emphasized for addressing the long-tail distribution issues, especially in datasets like COCO and LVIS, which benefit from enhanced diversity in rare classes.

Numerical Contributions and Observations

The efficacy of Add-SD is underscored by its impact on downstream tasks. In practical scenarios, synthetic data generated through this method results in an improvement of 4.3 mAP on rare classes in the LVIS validation set—indicative of the model’s capability to mitigate data scarcity challenges. In addition, comprehensive human evaluation affirms Add-SD's superiority in visual quality, object rationality, and consistency compared to traditional methods such as InstructPix2Pix and MagicBrush.

Implications and Future Directions

The architectural design of Add-SD offers broad implications for automated and efficient scene editing. By diminishing the dependency on intricate manual annotations, this method streamlines the process of realistic object incorporation, which could be significantly impactful across various computer vision applications, including personalized content creation and augmented reality domains.

Furthermore, while Add-SD already shows promising applicability, it raises several potential areas for future exploration. The refinement of text-based instruction interpretation, particularly in scenarios involving complex object relations and attributes, remains a fertile avenue for enhancing the robustness of such models. Additionally, expanding this framework to accommodate other forms of multimodal data could further bolster the versatility and applicability of diffusion models in real-world scenarios.

In conclusion, the Add-SD framework presents a compelling advance in diffusion models for image editing, effectively addressing several limitations of prior methodologies. Its structured approach to instruction-based object addition without manual references marks a notable contribution to the field of computer vision, offering pathways to more efficient and versatile visual content generation.

Github Logo Streamline Icon: https://streamlinehq.com

GitHub

Youtube Logo Streamline Icon: https://streamlinehq.com