Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SatSynth: Augmenting Image-Mask Pairs through Diffusion Models for Aerial Semantic Segmentation (2403.16605v1)

Published 25 Mar 2024 in cs.CV

Abstract: In recent years, semantic segmentation has become a pivotal tool in processing and interpreting satellite imagery. Yet, a prevalent limitation of supervised learning techniques remains the need for extensive manual annotations by experts. In this work, we explore the potential of generative image diffusion to address the scarcity of annotated data in earth observation tasks. The main idea is to learn the joint data manifold of images and labels, leveraging recent advancements in denoising diffusion probabilistic models. To the best of our knowledge, we are the first to generate both images and corresponding masks for satellite segmentation. We find that the obtained pairs not only display high quality in fine-scale features but also ensure a wide sampling diversity. Both aspects are crucial for earth observation data, where semantic classes can vary severely in scale and occurrence frequency. We employ the novel data instances for downstream segmentation, as a form of data augmentation. In our experiments, we provide comparisons to prior works based on discriminative diffusion models or GANs. We demonstrate that integrating generated samples yields significant quantitative improvements for satellite semantic segmentation -- both compared to baselines and when training only on the original data.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (74)
  1. Segdiff: Image segmentation with diffusion probabilistic models. arXiv preprint arXiv:2112.00390, 2021.
  2. Geography-aware self-supervised learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10181–10190, 2021.
  3. Synthetic data from diffusion models improves imagenet classification. arXiv preprint arXiv:2304.08466, 2023.
  4. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE transactions on pattern analysis and machine intelligence, 39(12):2481–2495, 2017.
  5. Ddpm-cd: Remote sensing change detection using denoising diffusion probabilistic models. arXiv preprint arXiv:2206.11892, 2022.
  6. Leaving reality to imagination: Robust classification via generated datasets. arXiv preprint arXiv:2302.02503, 2023.
  7. Label-efficient semantic segmentation with diffusion models. arXiv preprint arXiv:2112.03126, 2021.
  8. Semantic segmentation with boundary neural fields. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3602–3610, 2016.
  9. Denoising pretraining for semantic segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4175–4186, 2022.
  10. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence, 40(4):834–848, 2017a.
  11. Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587, 2017b.
  12. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European conference on computer vision (ECCV), pages 801–818, 2018.
  13. A generalist framework for panoptic segmentation of images and videos. arXiv preprint arXiv:2210.06366, 2022a.
  14. Analog bits: Generating discrete data using diffusion models with self-conditioning. arXiv preprint arXiv:2208.04202, 2022b.
  15. Satmae: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems, 35:197–211, 2022.
  16. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3213–3223, 2016.
  17. Semantic image segmentation: Two decades of research. Foundations and Trends® in Computer Graphics and Vision, 14(1-2):1–162, 2022.
  18. Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552, 2017.
  19. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021.
  20. The pascal visual object classes challenge: A retrospective. International journal of computer vision, 111:98–136, 2015.
  21. Virtual worlds as proxy for multi-object tracking analysis. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4340–4349, 2016.
  22. Simple copy-paste is a strong data augmentation method for instance segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2918–2928, 2021.
  23. Generative adversarial nets. Advances in neural information processing systems, 27, 2014.
  24. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022a.
  25. Is synthetic data from generative models ready for image recognition? arXiv preprint arXiv:2210.07574, 2022b.
  26. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
  27. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  28. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022a.
  29. Cascaded diffusion models for high fidelity image generation. The Journal of Machine Learning Research, 23(1):2249–2281, 2022b.
  30. Diffusion models for zero-shot open-vocabulary segmentation. arXiv preprint arXiv:2306.09316, 2023.
  31. Panoptic feature pyramid networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6399–6408, 2019.
  32. Semantic segmentation with generative models: Semi-supervised learning and strong out-of-domain generalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8300–8311, 2021a.
  33. Bigdatasetgan: Synthesizing imagenet with pixel-wise annotations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21330–21340, 2022.
  34. Pointflow: Flowing semantics through points for aerial image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4217–4226, 2021b.
  35. Refinenet: Multi-path refinement networks for high-resolution semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1925–1934, 2017a.
  36. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
  37. Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2117–2125, 2017b.
  38. Large-scale celebfaces attributes (celeba) dataset. Retrieved August, 15(2018):11, 2018.
  39. Repaint: Inpainting using denoising diffusion probabilistic models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11461–11471, 2022.
  40. Change-aware sampling and contrastive learning for satellite images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5261–5270, 2023.
  41. Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9414–9423, 2021.
  42. The role of context for object detection and semantic segmentation in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 891–898, 2014.
  43. Generating images with sparse representations. arXiv preprint arXiv:2103.03841, 2021.
  44. Dataset diffusion: Diffusion-based synthetic data generation for pixel-level semantic segmentation. Advances in Neural Information Processing Systems, 36, 2024.
  45. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
  46. Virtualhome: Simulating household activities via programs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8494–8502, 2018.
  47. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
  48. Playing for data: Ground truth from computer games. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14, pages 102–118. Springer, 2016.
  49. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
  50. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015.
  51. The synthia dataset: A large collection of synthetic images for semantic segmentation of urban scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3234–3243, 2016.
  52. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022.
  53. Improved techniques for training gans. Advances in neural information processing systems, 29, 2016.
  54. Monocular depth estimation using diffusion models. arXiv preprint arXiv:2302.14816, 2023.
  55. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning. PMLR, 2015.
  56. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
  57. Gated-scnn: Gated shape cnns for semantic segmentation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 5229–5238, 2019.
  58. Diffss: Diffusion model for few-shot semantic segmentation. arXiv preprint arXiv:2307.00773, 2023.
  59. Dynamicearthnet: Daily multi-spectral satellite dataset for semantic change segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21158–21167, 2022.
  60. Loveda: A remote sensing land-cover dataset for domain adaptive semantic segmentation. arXiv preprint arXiv:2110.08733, 2021.
  61. isaid: A large-scale dataset for instance segmentation in aerial images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 28–37, 2019.
  62. Diffumask: Synthesizing images with pixel-level annotations for semantic segmentation using diffusion models. arXiv preprint arXiv:2303.11681, 2023.
  63. Dota: A large-scale dataset for object detection in aerial images. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3974–3983, 2018.
  64. Openearthmap: A benchmark dataset for global high-resolution land cover mapping. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 6254–6264, 2023.
  65. Segformer: Simple and efficient design for semantic segmentation with transformers. Advances in Neural Information Processing Systems, 34:12077–12090, 2021.
  66. Open-vocabulary panoptic segmentation with text-to-image diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2955–2966, 2023.
  67. Sparse and complete latent organization for geospatial semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1809–1818, 2022.
  68. Cutmix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6023–6032, 2019.
  69. Datasetgan: Efficient labeled data factory with minimal human effort. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10145–10155, 2021.
  70. Inversion-based style transfer with diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10146–10156, 2023.
  71. Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2881–2890, 2017.
  72. Structured3d: A large photo-realistic dataset for structured 3d modeling. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IX 16, pages 519–535. Springer, 2020a.
  73. Foreground-aware relation network for geospatial object segmentation in high spatial resolution remote sensing imagery. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4096–4105, 2020b.
  74. Scene parsing through ade20k dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 633–641, 2017.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Aysim Toker (7 papers)
  2. Marvin Eisenberger (17 papers)
  3. Daniel Cremers (274 papers)
  4. Laura Leal-Taixé (74 papers)
Citations (11)

Summary

  • The paper introduces a novel DDPM framework that synthesizes paired satellite images and segmentation masks to augment scarce datasets.
  • It demonstrates significant improvements in semantic segmentation across benchmarks like iSAID, LoveDA, and OpenEarthMap.
  • The method offers a scalable solution for generating high-quality annotated data, benefiting remote sensing and related applications.

Augmenting Aerial Imagery Datasets with Synthetic Image-Mask Pairs via Diffusion Models for Semantic Segmentation

Introduction

The surge in availability and resolution of satellite imagery has ushered in a golden age for Earth observation, enabling advances in numerous humanitarian and environmental sectors. However, the paucity of corresponding annotated data remains a substantial bottleneck for leveraging the power of machine learning in this domain. In this paper titled "SatSynth: Augmenting Image-Mask Pairs through Diffusion Models for Aerial Semantic Segmentation," we explore the viability of denoising diffusion probabilistic models (DDPM) for generating synthetic satellite imagery with corresponding semantic labels, aimed at augmenting existing datasets. This approach serves to explore data synthesis where annotated data are scarce and expensive to produce, addressing a fundamental challenge in supervised learning for satellite imagery analysis.

Methodology

The authors propose a novel framework that utilizes DDPM to jointly generate paired satellite images and their corresponding semantic segmentation masks. This is achieved by learning the joint distribution of images and labels in a bit-space formulation, allowing for the synthesis of additional, diverse training instances. The core contributions include:

  1. Learning the joint data distribution of images and labels via a diffusion model, thereby enabling the synthesis of novel training data instances for data augmentation.
  2. Demonstrating significant improvements in semantic segmentation tasks on satellite images by incorporating the synthetic data instances into the training process.
  3. Offering a comprehensive evaluation of the proposed method against existing benchmarks, thereby establishing its effectiveness.

Extensive experiments on three satellite imagery benchmarks highlight the method's capability to generate high-quality, diverse synthetic image-mask pairs that, when used in conjunction with original dataset instances, lead to marked improvements in semantic segmentation performance.

Experiments and Results

The experimentations underscore the utility of the generated synthetic pairs for enhancing semantic segmentation models. When evaluated across multiple datasets such as iSAID, LoveDA, and OpenEarthMap, the inclusion of synthesized samples yielded notable quantitative improvements. The framework demonstrated superiority over baseline methods, including GAN-based and discriminative diffusion models previously applied to such tasks.

A noteworthy result was the consistent increase in segmentation performance across different baseline segmentation models when trained on the combined original and synthesized data, as opposed to training solely on the original dataset. This evidences the synthetic data's quality and its effectiveness in diversifying the training set. Furthermore, the proposed method's ability to significantly improve object-centric segmentation metrics on iSAID, a dataset with pronounced class imbalances and scale variation challenges, emphasizes the synthetic data's applicability to complex segmentation scenarios.

Theoretical Implications

This work suggests several promising directions for future research. The approach underscores the potential of DDPM in effectively learning joint image-label distributions, a relatively unexplored territory in the field of aerial image analysis. The technique's success in generating synthetic data that can pass as real—enough to improve downstream task performance—points to an avenue for generating datasets where manual annotation is impractical.

Practical Applications

Beyond academic interest, this research has direct implications for remote sensing, urban planning, environmental monitoring, and more, by substantially reducing the bottleneck of annotated data scarcity. The ability to synthesize realistic, diverse training data could accelerate the development of more accurate models for land use classification, disaster response, and other critical applications.

Conclusion

"SatSynth" presents a compelling case for the role of DDPM in addressing the data scarcity challenge in the domain of satellite image analysis. By generating high-quality, labeled synthetic data, this approach offers a scalable solution to enhance the performance of semantic segmentation models. The method's success across multiple benchmarks and its clear improvements over existing methods mark it as a significant step forward in the ongoing effort to leverage the full potential of satellite imagery for Earth observation.

X Twitter Logo Streamline Icon: https://streamlinehq.com