DODA: Diffusion for Object-detection Domain Adaptation in Agriculture (2403.18334v1)
Abstract: The diverse and high-quality content generated by recent generative models demonstrates the great potential of using synthetic data to train downstream models. However, in vision, especially in objection detection, related areas are not fully explored, the synthetic images are merely used to balance the long tails of existing datasets, and the accuracy of the generated labels is low, the full potential of generative models has not been exploited. In this paper, we propose DODA, a data synthesizer that can generate high-quality object detection data for new domains in agriculture. Specifically, we improve the controllability of layout-to-image through encoding layout as an image, thereby improving the quality of labels, and use a visual encoder to provide visual clues for the diffusion model to decouple visual features from the diffusion model, and empowering the model the ability to generate data in new domains. On the Global Wheat Head Detection (GWHD) Dataset, which is the largest dataset in agriculture and contains diverse domains, using the data synthesized by DODA improves the performance of the object detector by 12.74-17.76 AP$_{50}$ in the domain that was significantly shifted from the training data.
- B. Chen, P. Li, X. Chen, B. Wang, L. Zhang, and X.-S. Hua, “Dense learning based semi-supervised object detection,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 4815–4824.
- Z. Chen, W. Zhang, X. Wang, K. Chen, and Z. Wang, “Mixed pseudo labels for semi-supervised object detection,” arXiv preprint arXiv:2312.07006, 2023.
- Y.-C. Liu, C.-Y. Ma, Z. He, C.-W. Kuo, K. Chen, P. Zhang, B. Wu, Z. Kira, and P. Vajda, “Unbiased teacher for semi-supervised object detection,” arXiv preprint arXiv:2102.09480, 2021.
- M. Xu, Z. Zhang, H. Hu, J. Wang, L. Wang, F. Wei, X. Bai, and Z. Liu, “End-to-end semi-supervised object detection with soft teacher,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 3060–3069.
- E. David, M. Serouart, D. Smith, S. Madec, K. Velumani, S. Liu, X. Wang, F. Pinto, S. Shafiee, I. S. Tahir et al., “Global wheat head detection 2021: An improved dataset for benchmarking wheat head detection methods,” Plant Phenomics, 2021.
- X. Meng, C. Li, J. Li, X. Li, F. Guo, and Z. Xiao, “Yolov7-ma: Improved yolov7-based wheat head detection and counting,” Remote Sensing, vol. 15, no. 15, p. 3770, 2023.
- T. Wu, S. Zhong, H. Chen, and X. Geng, “Research on the method of counting wheat ears via video based on improved yolov7 and deepsort,” Sensors, vol. 23, no. 10, p. 4880, 2023.
- Y. Zhaosheng, L. Tao, Y. Tianle, J. Chengxin, and S. Chengming, “Rapid detection of wheat ears in orthophotos from unmanned aerial vehicles in fields based on yolox,” Frontiers in Plant Science, vol. 13, p. 851245, 2022.
- J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al., “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774, 2023.
- A. Mishra, A. Asai, V. Balachandran, Y. Wang, G. Neubig, Y. Tsvetkov, and H. Hajishirzi, “Fine-grained hallucination detection and editing for language models,” arXiv preprint arXiv:2401.06855, 2024.
- D. Bernsohn, G. Semo, Y. Vazana, G. Hayat, B. Hagag, J. Niklaus, R. Saha, and K. Truskovskyi, “Legallens: Leveraging llms for legal violation identification in unstructured text,” arXiv preprint arXiv:2402.04335, 2024.
- E. C. Choi and E. Ferrara, “Fact-gpt: Fact-checking augmentation via claim matching with llms,” arXiv preprint arXiv:2402.05904, 2024.
- J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Advances in neural information processing systems, vol. 33, pp. 6840–6851, 2020.
- R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10 684–10 695.
- A. Jahanian, X. Puig, Y. Tian, and P. Isola, “Generative models as a data source for multiview representation learning,” arXiv preprint arXiv:2106.05258, 2021.
- Y. Tian, L. Fan, P. Isola, H. Chang, and D. Krishnan, “Stablerep: Synthetic images from text-to-image models make strong visual representation learners,” Advances in Neural Information Processing Systems, vol. 36, 2024.
- Y. Tian, L. Fan, K. Chen, D. Katabi, D. Krishnan, and P. Isola, “Learning vision from models rivals learning vision from data,” arXiv preprint arXiv:2312.17742, 2023.
- S. Azizi, S. Kornblith, C. Saharia, M. Norouzi, and D. J. Fleet, “Synthetic data from diffusion models improves imagenet classification,” arXiv preprint arXiv:2304.08466, 2023.
- M. B. Sarıyıldız, K. Alahari, D. Larlus, and Y. Kalantidis, “Fake it till you make it: Learning transferable representations from synthetic imagenet clones,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 8011–8021.
- J. Schnell, J. Wang, L. Qi, V. T. Hu, and M. Tang, “Generative data augmentation improves scribble-supervised semantic segmentation,” arXiv preprint arXiv:2311.17121, 2023.
- W. Tan, S. Chen, and B. Yan, “Diffss: Diffusion model for few-shot semantic segmentation,” arXiv preprint arXiv:2307.00773, 2023.
- J. Xie, W. Li, X. Li, Z. Liu, Y. S. Ong, and C. C. Loy, “Mosaicfusion: Diffusion models as data augmenters for large vocabulary instance segmentation,” arXiv preprint arXiv:2309.13042, 2023.
- Y. Ge, J. Xu, B. N. Zhao, N. Joshi, L. Itti, and V. Vineet, “Dall-e for detection: Language-driven compositional image synthesis for object detection,” arXiv preprint arXiv:2206.09592, 2022.
- S. Lin, K. Wang, X. Zeng, and R. Zhao, “Explore the power of synthetic data on few-shot object detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 638–647.
- M. Zhang, J. Wu, Y. Ren, M. Li, J. Qin, X. Xiao, W. Liu, R. Wang, M. Zheng, and A. J. Ma, “Diffusionengine: Diffusion model is scalable data engine for object detection,” arXiv preprint arXiv:2309.03893, 2023.
- K. Chen, E. Xie, Z. Chen, L. Hong, Z. Li, and D.-Y. Yeung, “Integrating geometric control into text-to-image diffusion models for high-quality detection data generation via text prompt,” arXiv preprint arXiv:2306.04607, 2023.
- T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13. Springer, 2014, pp. 740–755.
- T. Salimans, A. Karpathy, X. Chen, and D. P. Kingma, “Pixelcnn++: Improving the pixelcnn with discretized logistic mixture likelihood and other modifications,” arXiv preprint arXiv:1701.05517, 2017.
- A. Van den Oord, N. Kalchbrenner, L. Espeholt, O. Vinyals, A. Graves et al., “Conditional image generation with pixelcnn decoders,” Advances in neural information processing systems, vol. 29, 2016.
- I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” Advances in neural information processing systems, vol. 27, 2014.
- T. Karras, S. Laine, and T. Aila, “A style-based generator architecture for generative adversarial networks,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 4401–4410.
- J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2223–2232.
- D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” arXiv preprint arXiv:1312.6114, 2013.
- A. Van Den Oord, O. Vinyals et al., “Neural discrete representation learning,” Advances in neural information processing systems, vol. 30, 2017.
- P. Dhariwal and A. Nichol, “Diffusion models beat gans on image synthesis,” Advances in neural information processing systems, vol. 34, pp. 8780–8794, 2021.
- C. Lu, Y. Zhou, F. Bao, J. Chen, C. Li, and J. Zhu, “Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps,” Advances in Neural Information Processing Systems, vol. 35, pp. 5775–5787, 2022.
- A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford, M. Chen, and I. Sutskever, “Zero-shot text-to-image generation,” in International conference on machine learning. Pmlr, 2021, pp. 8821–8831.
- C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. L. Denton, K. Ghasemipour, R. Gontijo Lopes, B. Karagol Ayan, T. Salimans et al., “Photorealistic text-to-image diffusion models with deep language understanding,” Advances in neural information processing systems, vol. 35, pp. 36 479–36 494, 2022.
- J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit models,” arXiv preprint arXiv:2010.02502, 2020.
- V. Besnier, H. Jain, A. Bursuc, M. Cord, and P. Pérez, “This dataset does not exist: training models from generated images,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 1–5.
- M. Frid-Adar, E. Klang, M. Amitai, J. Goldberger, and H. Greenspan, “Synthetic data augmentation using gan for improved liver lesion classification,” in 2018 IEEE 15th international symposium on biomedical imaging (ISBI 2018). IEEE, 2018, pp. 289–293.
- D. Li, J. Yang, K. Kreis, A. Torralba, and S. Fidler, “Semantic segmentation with generative models: Semi-supervised learning and strong out-of-domain generalization,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 8300–8311.
- W. Zhang, K. Chen, J. Wang, Y. Shi, and W. Guo, “Easy domain adaptation method for filling the species gap in deep learning-based fruit detection,” Horticulture Research, vol. 8, 2021.
- X. Chen, Z. Liu, S. Xie, and K. He, “Deconstructing denoising diffusion models for self-supervised learning,” arXiv preprint arXiv:2401.14404, 2024.
- W. Xiang, H. Yang, D. Huang, and Y. Wang, “Denoising diffusion autoencoders are unified self-supervised learners,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 15 802–15 812.
- A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning. PMLR, 2021, pp. 8748–8763.
- K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick, “Masked autoencoders are scalable vision learners,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 16 000–16 009.
- M. Oquab, T. Darcet, T. Moutakanni, H. Vo, M. Szafraniec, V. Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby et al., “Dinov2: Learning robust visual features without supervision,” arXiv preprint arXiv:2304.07193, 2023.
- W. Sun and T. Wu, “Image synthesis from reconfigurable layout and style,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 10 531–10 540.
- B. Wang, T. Wu, M. Zhu, and P. Du, “Interactive image synthesis with panoptic layout generation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 7783–7792.
- G. Zheng, X. Zhou, X. Li, Z. Qi, Y. Shan, and X. Li, “Layoutdiffusion: Controllable diffusion model for layout-to-image generation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 22 490–22 499.
- J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
- M. Wang and W. Deng, “Deep visual domain adaptation: A survey,” Neurocomputing, vol. 312, pp. 135–153, 2018.
- S. Motiian, M. Piccirilli, D. A. Adjeroh, and G. Doretto, “Unified deep supervised domain adaptation and generalization,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 5715–5725.
- E. Tzeng, J. Hoffman, T. Darrell, and K. Saenko, “Simultaneous deep transfer across domains and tasks,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 4068–4076.
- A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
- N. Park, W. Kim, B. Heo, T. Kim, and S. Yun, “What do self-supervised vision transformers learn?” arXiv preprint arXiv:2305.00729, 2023.
- A. Vanyan, A. Barseghyan, H. Tamazyan, V. Huroyan, H. Khachatrian, and M. Danelljan, “Analyzing local representations of self-supervised vision transformers,” arXiv preprint arXiv:2401.00463, 2023.
- L. Zhang, A. Rao, and M. Agrawala, “Adding conditional control to text-to-image diffusion models,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 3836–3847.
- W. Chen, J. Wu, P. Xie, H. Wu, J. Li, X. Xia, X. Xiao, and L. Lin, “Control-a-video: Controllable text-to-video generation with diffusion models,” arXiv preprint arXiv:2305.13840, 2023.
- W. Lu, Y. Xu, J. Zhang, C. Wang, and D. Tao, “Handrefiner: Refining malformed hands in generated images by diffusion-based conditional inpainting,” arXiv preprint arXiv:2311.17957, 2023.
- D. J. Zhang, D. Li, H. Le, M. Z. Shou, C. Xiong, and D. Sahoo, “Moonshot: Towards controllable video generation and editing with multimodal conditions,” arXiv preprint arXiv:2401.01827, 2024.
- Y. Zhang, Y. Wei, D. Jiang, X. Zhang, W. Zuo, and Q. Tian, “Controlvideo: Training-free controllable text-to-video generation,” arXiv preprint arXiv:2305.13077, 2023.
- M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “Gans trained by a two time-scale update rule converge to a local nash equilibrium,” Advances in neural information processing systems, vol. 30, 2017.
- C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2818–2826.
- T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen, “Improved techniques for training gans,” Advances in neural information processing systems, vol. 29, 2016.
- Z. Ge, S. Liu, F. Wang, Z. Li, and J. Sun, “Yolox: Exceeding yolo series in 2021,” arXiv preprint arXiv:2107.08430, 2021.
- X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, “Deformable detr: Deformable transformers for end-to-end object detection,” arXiv preprint arXiv:2010.04159, 2020.
- C.-Y. Wang, A. Bochkovskiy, and H.-Y. M. Liao, “Yolov7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 7464–7475.
- Z. Tian, C. Shen, H. Chen, and T. He, “FCOS: Fully Convolutional One-Stage Object Detection,” arXiv e-prints, p. arXiv:1904.01355, Apr. 2019.
- L. McInnes, J. Healy, and J. Melville, “Umap: Uniform manifold approximation and projection for dimension reduction,” arXiv preprint arXiv:1802.03426, 2018.