Sat2Cap: Mapping Fine-Grained Textual Descriptions from Satellite Images (2307.15904v2)
Abstract: We propose a weakly supervised approach for creating maps using free-form textual descriptions. We refer to this work of creating textual maps as zero-shot mapping. Prior works have approached mapping tasks by developing models that predict a fixed set of attributes using overhead imagery. However, these models are very restrictive as they can only solve highly specific tasks for which they were trained. Mapping text, on the other hand, allows us to solve a large variety of mapping problems with minimal restrictions. To achieve this, we train a contrastive learning framework called Sat2Cap on a new large-scale dataset with 6.1M pairs of overhead and ground-level images. For a given location and overhead image, our model predicts the expected CLIP embeddings of the ground-level scenery. The predicted CLIP embeddings are then used to learn about the textual space associated with that location. Sat2Cap is also conditioned on date-time information, allowing it to model temporally varying concepts over a location. Our experimental results demonstrate that our models successfully capture ground-level concepts and allow large-scale mapping of fine-grained textual queries. Our approach does not require any text-labeled data, making the training easily scalable. The code, dataset, and models will be made publicly available.
- A deep learning framework for land-use/land-cover mapping and analysis using multispectral satellite imagery. Neural Computing and Applications, 32:8529–8544, 2020.
- Multi-scale digital soil mapping with deep learning. Scientific reports, 8(1):15244, 2018.
- Beyond spatial auto-regressive models: Predicting housing prices with satellite imagery. In 2017 IEEE winter conference on applications of computer vision (WACV), pages 320–329. IEEE, 2017.
- Fine-grained image captioning with clip reward. arXiv preprint arXiv:2205.13115, 2022.
- Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 702–703, 2020.
- Virtex: Learning visual representations from textual annotations. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11162–11173, 2021.
- Machine learning data-driven approaches for land use/cover mapping and trend analysis using google earth engine. Journal of Environmental Planning and Management, 66(3):665–697, 2023.
- What goes where: Predicting object distributions from above. In IGARSS 2018-2018 IEEE International Geoscience and Remote Sensing Symposium, pages 4375–4378. IEEE, 2018.
- Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729–9738, 2020.
- Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning, pages 4904–4916. PMLR, 2021.
- Global land use / land cover with sentinel 2 and deep learning. In 2021 IEEE International Geoscience and Remote Sensing Symposium IGARSS, pages 4704–4707, 2021.
- Towards delivering on the sustainable development goals using earth observations, 2020.
- Transient attributes for high-level understanding and editing of outdoor scenes. ACM Transactions on graphics (TOG), 33(4):1–11, 2014.
- Align before fuse: Vision and language representation learning with momentum distillation. Advances in neural information processing systems, 34:9694–9705, 2021.
- Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
- Clipcap: Clip prefix for image captioning. arXiv preprint arXiv:2111.09734, 2021.
- Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
- Explainable identification and mapping of trees using uav rgb image and deep learning. Scientific reports, 11(1):903, 2021.
- Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
- Deep learning and earth observation to support the sustainable development goals: current approaches, open challenges, and future opportunities. ieee geosci remote sens mag 10 (2): 172–200, 2022.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
- Learning a dynamic map of visual appearance. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12435–12444, 2020.
- Mpnet: Masked and permuted pre-training for language understanding. Advances in Neural Information Processing Systems, 33:16857–16867, 2020.
- Estimating residential building energy consumption using overhead imagery. Applied Energy, 280:116018, 2020.
- Yfcc100m: The new data in multimedia research. Communications of the ACM, 59(2):64–73, 2016.
- Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533, 2022a.
- Clip-gen: Language-free training of a text-to-image generator with clip. arXiv preprint arXiv:2203.00386, 2022b.
- Skyscript: A large and semantically diverse vision-language dataset for remote sensing. arXiv preprint arXiv:2312.12856, 2023.
- Dynamic traffic modeling from overhead imagery. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12315–12324, 2020.
- Understanding and mapping natural beauty. In Proceedings of the IEEE International Conference on Computer Vision, pages 5589–5598, 2017.
- Vision-language pre-training with triple contrastive learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15671–15680, 2022a.
- Unified contrastive learning in image-text-label space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19163–19173, 2022b.
- Filip: fine-grained interactive language-image pre-training. arXiv preprint arXiv:2111.07783, 2021.
- Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432, 2021.
- Chatearthnet: A global-scale, high-quality image-text dataset for remote sensing. arXiv preprint arXiv:2402.11325, 2024.
- Contrastive learning of medical visual representations from paired images and text. In Machine Learning for Healthcare Conference, pages 2–25. PMLR, 2022.
- Places: A 10 million image database for scene recognition. IEEE transactions on pattern analysis and machine intelligence, 40(6):1452–1464, 2017.
- Deepdpm: Dynamic population mapping via deep neural network. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 1294–1301, 2019.