MMM-RS: A Multi-modal, Multi-GSD, Multi-scene Remote Sensing Dataset and Benchmark for Text-to-Image Generation
Abstract: Recently, the diffusion-based generative paradigm has achieved impressive general image generation capabilities with text prompts due to its accurate distribution modeling and stable training process. However, generating diverse remote sensing (RS) images that are tremendously different from general images in terms of scale and perspective remains a formidable challenge due to the lack of a comprehensive remote sensing image generation dataset with various modalities, ground sample distances (GSD), and scenes. In this paper, we propose a Multi-modal, Multi-GSD, Multi-scene Remote Sensing (MMM-RS) dataset and benchmark for text-to-image generation in diverse remote sensing scenarios. Specifically, we first collect nine publicly available RS datasets and conduct standardization for all samples. To bridge RS images to textual semantic information, we utilize a large-scale pretrained vision-LLM to automatically output text prompts and perform hand-crafted rectification, resulting in information-rich text-image pairs (including multi-modal images). In particular, we design some methods to obtain the images with different GSD and various environments (e.g., low-light, foggy) in a single sample. With extensive manual screening and refining annotations, we ultimately obtain a MMM-RS dataset that comprises approximately 2.1 million text-image pairs. Extensive experimental results verify that our proposed MMM-RS dataset allows off-the-shelf diffusion models to generate diverse RS images across various modalities, scenes, weather conditions, and GSD. The dataset is available at https://github.com/ljl5261/MMM-RS.
- Marvin E. Bauer. Remote sensing of environment: History, philosophy, approach and contributions, 1969 –2019. Remote Sensing of Environment, 237:111522, Feb 2020.
- A spatial-temporal attention-based method and a new dataset for remote sensing image change detection. Remote Sensing, 12(10):1662, 2020.
- Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE, 105(10):1865–1883, 2017.
- Nwpu-captions dataset and mlca-net for remote sensing image captioning. IEEE Transactions on Geoscience and Remote Sensing, 60:1–19, 2022.
- Functional map of the world. In CVPR, 2018.
- Raanan Fattal. Single image dehazing. ACM transactions on graphics (TOG), 27(3):1–9, 2008.
- Changedetection. net: A new change detection benchmark dataset. In 2012 IEEE computer society conference on computer vision and pattern recognition workshops, pages 1–8. IEEE, 2012.
- Remote sensing scene classification using multilayer stacked covariance pooling. IEEE Transactions on Geoscience and Remote Sensing, page 6899–6910, Dec 2018.
- Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
- Rsgpt: A remote sensing vision language model and benchmark. arXiv preprint arXiv:2307.15266, 2023.
- Lora: Low-rank adaptation of large language models. arXiv: Computation and Language,arXiv: Computation and Language, Jun 2021.
- xview: Objects in context in overhead imagery. arXiv preprint arXiv:1802.07856, 2018.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning, pages 19730–19742. PMLR, 2023.
- Mcanet: A joint semantic segmentation framework of optical and sar images for land use classification. International Journal of Applied Earth Observation and Geoinformation, 106:102638, 2022.
- Pixel-level change detection pseudo-label learning for remote sensing change captioning. arXiv preprint arXiv:2312.15311, 2023.
- Remote-sensing cross-domain scene classification: A dataset and benchmark. Remote Sensing, 14(18):4635, 2022.
- Exploring models and data for remote sensing image caption generation. IEEE Transactions on Geoscience and Remote Sensing, 56(4):2183–2195, 2018.
- Sam-assisted remote sensing imagery semantic segmentation with object and boundary constraints. arXiv preprint arXiv:2312.02464, 2023.
- Can semantic labeling methods generalize to any city? the inria aerial image labeling benchmark. In 2017 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Jul 2017.
- OpenAI. Dall-e 3. https://openai.com/index/dall-e-3/.
- Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.
- Deep semantic understanding of high resolution remote sensing image. In 2016 International Conference on Computer, Information and Telecommunication Systems (CITS), Jul 2016.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
- Improved techniques for training gans. Advances in neural information processing systems, 29, 2016.
- The sen1-2 dataset for deep learning in sar-optical data fusion. arXiv preprint arXiv:1807.01569, 2018.
- Performance evaluation of single-label and multi-label remote sensing image retrieval using a dense labeling dataset. Remote Sensing, 10(6):964, 2018.
- Hi-ucd: A large-scale dataset for urban semantic change detection in remote sensing imagery. arXiv preprint arXiv:2011.03247, 2020.
- Land-cover classification with high-resolution remote sensing images using transferable deep models. Remote Sensing of Environment, page 111322, Feb 2020.
- CJ Van Westen. Remote sensing for natural disaster management. International archives of photogrammetry and remote sensing, 33(B7/4; PART 7):1609–1617, 2000.
- Samrs: Scaling-up remote sensing segmentation dataset with segment anything model. Advances in Neural Information Processing Systems, 36, 2024.
- Esrgan: Enhanced super-resolution generative adversarial networks. In Proceedings of the European conference on computer vision (ECCV) workshops, pages 0–0, 2018.
- Distribution-consistent modal recovering for incomplete multimodal learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 22025–22034, 2023.
- Incomplete multimodality-diffused emotion recognition. Advances in Neural Information Processing Systems, 36, 2024.
- Edit temporal-consistent videos with image diffusion model. ACM Transactions on Multimedia Computing, Communications, and Applications, 2024.
- Dota: A large-scale dataset for object detection in aerial images. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3974–3983, 2018.
- A review of remote sensing applications in urban planning and management in china. In 2009 Joint Urban Remote Sensing Event, May 2009.
- Bag-of-visual-words and spatial extensions for land-use classification. In Proceedings of the 18th SIGSPATIAL international conference on advances in geographic information systems, pages 270–279, 2010.
- Deep gaussian process for crop yield prediction based on remote sensing data. Proceedings of the AAAI Conference on Artificial Intelligence, 31(1), Jun 2022.
- Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023.
- S-cnn-based ship detection from high-resolution remote sensing images. The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, page 423–430, Jun 2016.
- Remote sensing object detection meets deep learning: A metareview of challenges and advances. IEEE Geoscience and Remote Sensing Magazine, 2023.
- Hierarchical and robust convolutional neural network for very high-resolution remote sensing object detection. IEEE Transactions on Geoscience and Remote Sensing, 57(8):5535–5548, 2019.
- Rs5m: A large scale vision-language dataset for remote sensing vision-language foundation model. arXiv preprint arXiv:2306.11300, 2023.
- Tpsence: Towards artifact-free realistic rain generation for deraining and object detection in rain. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 5394–5403, 2024.
- Nasc-tg2: Natural scene classification with tiangong-2 remotely sensed imagery. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, page 3228–3242, Jan 2021.
- Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pages 2223–2232, 2017.
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.