Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

General-Purpose Multimodal Transformer meets Remote Sensing Semantic Segmentation (2307.03388v1)

Published 7 Jul 2023 in cs.CV

Abstract: The advent of high-resolution multispectral/hyperspectral sensors, LiDAR DSM (Digital Surface Model) information and many others has provided us with an unprecedented wealth of data for Earth Observation. Multimodal AI seeks to exploit those complementary data sources, particularly for complex tasks like semantic segmentation. While specialized architectures have been developed, they are highly complicated via significant effort in model design, and require considerable re-engineering whenever a new modality emerges. Recent trends in general-purpose multimodal networks have shown great potential to achieve state-of-the-art performance across multiple multimodal tasks with one unified architecture. In this work, we investigate the performance of PerceiverIO, one in the general-purpose multimodal family, in the remote sensing semantic segmentation domain. Our experiments reveal that this ostensibly universal network struggles with object scale variation in remote sensing images and fails to detect the presence of cars from a top-down view. To address these issues, even with extreme class imbalance issues, we propose a spatial and volumetric learning component. Specifically, we design a UNet-inspired module that employs 3D convolution to encode vital local information and learn cross-modal features simultaneously, while reducing network computational burden via the cross-attention mechanism of PerceiverIO. The effectiveness of the proposed component is validated through extensive experiments comparing it with other methods such as 2D convolution, and dual local module (\ie the combination of Conv2D 1x1 and Conv2D 3x3 inspired by UNetFormer). The proposed method achieves competitive results with specialized architectures like UNetFormer and SwinUNet, showing its potential to minimize network architecture engineering with a minimal compromise on the performance.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (22)
  1. 2D Semantic Label. - Vaihingen, Apr. 2023. [Online; accessed 17. Apr. 2023].
  2. 2D Semantic Labeling Contest - Potsdam, Apr. 2023. [Online; accessed 17. Apr. 2023].
  3. A Novel Focal Tversky loss function with improved Attention U-Net for lesion segmentation. arXiv, Oct. 2018.
  4. MultiMAE: Multi-modal Multi-task Masked Autoencoders. arXiv, Apr. 2022.
  5. Swin-Unet: Unet-like Pure Transformer for Medical Image Segmentation. arXiv, May 2021.
  6. HiP: Hierarchical Perceiver. arXiv, Feb. 2022.
  7. TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation. arXiv, Feb. 2021.
  8. Multimodal Remote Sensing Image Segmentation With Intuition-Inspired Hypergraph Modeling. IEEE Trans. Image Process., 32:1474–1487, Feb. 2023.
  9. Perceiver IO: A general architecture for structured inputs & outputs. In International Conference on Learning Representations, 2022.
  10. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. arXiv, Mar. 2021.
  11. A Crossmodal Multiscale Fusion Network for Semantic Segmentation of Remote Sensing Data. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens., 15:3463–3474, Apr. 2022.
  12. MMFlood: A Multimodal Dataset for Flood Delineation From Satellite Imagery. IEEE Access, 10:96774–96787, Sept. 2022.
  13. An Introduction to Convolutional Neural Networks. arXiv, Nov. 2015.
  14. A General Purpose Neural Architecture for Geospatial Systems. arXiv, Nov. 2022.
  15. U-Net: Convolutional Networks for Biomedical Image Segmentation. arXiv, May 2015.
  16. Generalised Dice overlap as a deep learning loss function for highly unbalanced segmentations. arXiv, July 2017.
  17. UCTransNet: Rethinking the Skip Connections in U-Net from a Channel-wise Perspective with Transformer. arXiv, Sept. 2021.
  18. A Novel Transformer Based Semantic Segmentation Scheme for Fine-Resolution Remote Sensing Images. arXiv, Apr. 2021.
  19. Unetformer: A unet-like transformer for efficient semantic segmentation of remote sensing urban scene imagery. ISPRS Journal of Photogrammetry and Remote Sensing, 190:196–214, 2022.
  20. SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers. arXiv, May 2021.
  21. Advanced Multi-Sensor Optical Remote Sensing for Urban Land Use and Land Cover Classification: Outcome of the 2018 IEEE GRSS Data Fusion Contest. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens., 12(6):1709–1724, May 2019.
  22. Unified Focal loss: Generalising Dice and cross entropy-based losses to handle class imbalanced medical image segmentation. Comput. Med. Imaging Graph., 95:102026, Jan. 2022.
Citations (3)

Summary

We haven't generated a summary for this paper yet.