ODGEN: Domain-specific Object Detection Data Generation with Diffusion Models (2405.15199v2)
Abstract: Modern diffusion-based image generative models have made significant progress and become promising to enrich training data for the object detection task. However, the generation quality and the controllability for complex scenes containing multi-class objects and dense objects with occlusions remain limited. This paper presents ODGEN, a novel method to generate high-quality images conditioned on bounding boxes, thereby facilitating data synthesis for object detection. Given a domain-specific object detection dataset, we first fine-tune a pre-trained diffusion model on both cropped foreground objects and entire images to fit target distributions. Then we propose to control the diffusion model using synthesized visual prompts with spatial constraints and object-wise textual descriptions. ODGEN exhibits robustness in handling complex scenes and specific domains. Further, we design a dataset synthesis pipeline to evaluate ODGEN on 7 domain-specific benchmarks to demonstrate its effectiveness. Adding training data generated by ODGEN improves up to 25.3% [email protected]:.95 with object detectors like YOLOv5 and YOLOv7, outperforming prior controllable generative methods. In addition, we design an evaluation protocol based on COCO-2014 to validate ODGEN in general domains and observe an advantage up to 5.6% in [email protected]:.95 against existing methods.
- Synthetic data from diffusion models improves imagenet classification. Transactions on Machine Learning Research, 2023.
- ediffi: Text-to-image diffusion models with an ensemble of expert denoisers. arXiv preprint arXiv:2211.01324, 2022.
- Multidiffusion: fusing diffusion paths for controlled image generation. In Proceedings of the 40th International Conference on Machine Learning, pages 1737–1752, 2023.
- Large scale GAN training for high fidelity natural image synthesis. In International Conference on Learning Representations, 2019.
- Geodiffusion: Text-prompted geometric con-trol for object detection data generation. 2023.
- Roboflow 100: A rich, multi-domain object detection benchmark. arXiv preprint arXiv:2211.13523, 2022.
- P. Dhariwal and A. Nichol. Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 34:8780–8794, 2021.
- Diversify your vision datasets with automatic diffusion-based augmentation. Advances in Neural Information Processing Systems, 36, 2024.
- Modeling visual context is key to augmenting object detection datasets. In Proceedings of the European Conference on Computer Vision (ECCV), pages 364–380, 2018.
- Data augmentation for object detection via controllable diffusion models. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1257–1266, 2024.
- Instagen: Enhancing object detection by training on synthetic dataset. arXiv preprint arXiv:2402.05937, 2024.
- Dall-e for detection: Language-driven compositional image synthesis for object detection. arXiv preprint arXiv:2206.09592, 2022.
- Simple copy-paste is a strong data augmentation method for instance segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2918–2928, 2021.
- Prompting diffusion representations for cross-domain semantic segmentation. arXiv preprint arXiv:2307.02138, 2023.
- Generative adversarial nets. Advances in Neural Information Processing Systems, 27, 2014.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- Is synthetic data from generative models ready for image recognition? In The Eleventh International Conference on Learning Representations, 2022.
- Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in Neural Information Processing Systems, 30, 2017.
- Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020.
- High-resolution complex scene synthesis with transformers. arXiv preprint arXiv:2105.06458, 2021.
- Dginstyle: Domain-generalizable semantic segmentation with image diffusion models and stylized semantic control. arXiv preprint arXiv:2312.03048, 2023.
- Layoutdiffuse: Adapting foundational diffusion models for layout-to-image generation. arXiv preprint arXiv:2302.08908, 2023.
- ultralytics/yolov5: v6. 2-yolov5 classification models, apple m1, reproducibility, clearml and deci. ai integrations. Zenodo, 2022.
- Alias-free generative adversarial networks. Advances in Neural Information Processing Systems, 34:852–863, 2021.
- A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4401–4410, 2019.
- Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8110–8119, 2020.
- Variational diffusion models. Advances in Neural Information Processing Systems, 34:21696–21707, 2021.
- D. P. Kingma and M. Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
- Text-image alignment for diffusion-based perception. arXiv preprint arXiv:2310.00031, 2023.
- Multi-concept customization of text-to-image diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1931–1941, 2023.
- Gligen: Open-set grounded text-to-image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22511–22521, 2023.
- Is synthetic data from diffusion models ready for knowledge distillation? arXiv preprint arXiv:2305.12954, 2023.
- Image synthesis from layout with locality-aware mask adaption. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13819–13828, 2021.
- Open-vocabulary object segmentation with diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7667–7676, 2023.
- Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
- A. Q. Nichol and P. Dhariwal. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, pages 8162–8171. PMLR, 2021.
- Diffusion-based image translation with label guidance for domain adaptive semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 808–820, 2023.
- Sdxl: Improving latent diffusion models for high-resolution image synthesis. In International Conference on Learning Representations, 2024.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
- Stochastic backpropagation and approximate inference in deep generative models. In International Conference on Machine Learning, pages 1278–1286. PMLR, 2014.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
- Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22500–22510, 2023.
- Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022.
- Fake it till you make it: Learning transferable representations from synthetic imagenet clones. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8011–8021, 2023.
- Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294, 2022.
- Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pages 2256–2265, 2015.
- Y. Song and S. Ermon. Improved techniques for training score-based generative models. Advances in Neural Information Processing Systems, 33:12438–12448, 2020.
- Effective data augmentation with diffusion models. In The Twelfth International Conference on Learning Representations, 2023.
- A. Vahdat and J. Kautz. Nvae: A deep hierarchical variational autoencoder. Advances in Neural Information Processing Systems, 33:19667–19679, 2020.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Yolov7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7464–7475, 2023.
- Instancediffusion: Instance-level control for image generation. arXiv preprint arXiv:2402.03290, 2024.
- Datasetdm: Synthesizing data with perception annotations using diffusion models. Advances in Neural Information Processing Systems, 36:54683–54695, 2023.
- Diffumask: Synthesizing images with pixel-level annotations for semantic segmentation using diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1206–1217, 2023.
- Boxdiff: Text-to-image synthesis with training-free box-constrained diffusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7452–7461, 2023.
- Open-vocabulary panoptic segmentation with text-to-image diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2955–2966, 2023.
- Freemask: Synthetic images with dense annotations make stronger segmentation models. Advances in Neural Information Processing Systems, 36, 2024.
- Reco: Region-controlled text-to-image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14246–14255, 2023.
- Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023.
- Diffusionengine: Diffusion model is scalable data engine for object detection. arXiv preprint arXiv:2309.03893, 2023.
- Datasetgan: Efficient labeled data factory with minimal human effort. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10145–10155, 2021.
- X-paste: revisiting scalable copy-paste for instance segmentation using clip and stablediffusion. In International Conference on Machine Learning, pages 42098–42109. PMLR, 2023.
- Layoutdiffusion: Controllable diffusion model for layout-to-image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22490–22499, 2023.
- Migc: Multi-instance generation controller for text-to-image synthesis. arXiv preprint arXiv:2402.05408, 2024.
- Jingyuan Zhu (11 papers)
- Shiyu Li (37 papers)
- Yuxuan Liu (97 papers)
- Ping Huang (34 papers)
- Jiulong Shan (22 papers)
- Huimin Ma (44 papers)
- Jian Yuan (57 papers)