HiT: Building Mapping with Hierarchical Transformers (2309.09643v2)
Abstract: Deep learning-based methods have been extensively explored for automatic building mapping from high-resolution remote sensing images over recent years. While most building mapping models produce vector polygons of buildings for geographic and mapping systems, dominant methods typically decompose polygonal building extraction in some sub-problems, including segmentation, polygonization, and regularization, leading to complex inference procedures, low accuracy, and poor generalization. In this paper, we propose a simple and novel building mapping method with Hierarchical Transformers, called HiT, improving polygonal building mapping quality from high-resolution remote sensing images. HiT builds on a two-stage detection architecture by adding a polygon head parallel to classification and bounding box regression heads. HiT simultaneously outputs building bounding boxes and vector polygons, which is fully end-to-end trainable. The polygon head formulates a building polygon as serialized vertices with the bidirectional characteristic, a simple and elegant polygon representation avoiding the start or end vertex hypothesis. Under this new perspective, the polygon head adopts a transformer encoder-decoder architecture to predict serialized vertices supervised by the designed bidirectional polygon loss. Furthermore, a hierarchical attention mechanism combined with convolution operation is introduced in the encoder of the polygon head, providing more geometric structures of building polygons at vertex and edge levels. Comprehensive experiments on two benchmarks (the CrowdAI and Inria datasets) demonstrate that our method achieves a new state-of-the-art in terms of instance segmentation and polygonal metrics compared with state-of-the-art methods. Moreover, qualitative results verify the superiority and effectiveness of our model under complex scenes.
- J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in CVPR, 2015.
- O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computer-assisted intervention, 2015.
- L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,” IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 4, pp. 834–848, 2017.
- L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, “Encoder-decoder with atrous separable convolution for semantic image segmentation,” in Proceedings of the European conference on computer vision (ECCV), 2018.
- S. Ji, S. Wei, and M. Lu, “Fully convolutional networks for multisource building extraction from an open aerial and satellite imagery data set,” IEEE Transactions on geoscience and remote sensing, vol. 57, no. 1, pp. 574–586, 2018.
- Z. Zhang, W. Guo, M. Li, and W. Yu, “Gis-supervised building extraction with label noise-adaptive fully convolutional neural network,” IEEE Geoscience and Remote Sensing Letters, vol. 17, no. 12, pp. 2135–2139, 2020.
- M. Guo, H. Liu, Y. Xu, and Y. Huang, “Building extraction based on u-net with an attention block and multiple losses,” Remote Sensing, vol. 12, no. 9, p. 1400, 2020.
- Q. Zhu, Z. Li, Y. Zhang, and Q. Guan, “Building extraction from high spatial resolution remote sensing images via multiscale-aware and segmentation-prior conditional random fields,” Remote Sensing, vol. 12, no. 23, p. 3983, 2020.
- C. Wang and L. Li, “Multi-scale residual deep network for semantic segmentation of buildings with regularizer of shape representation,” Remote Sensing, vol. 12, no. 18, p. 2932, 2020.
- Q. Wen, K. Jiang, W. Wang, Q. Liu, Q. Guo, L. Li, and P. Wang, “Automatic building extraction from google earth images under complex backgrounds based on deep instance segmentation network,” Sensors, vol. 19, no. 2, p. 333, 2019.
- T. Wu, Y. Hu, L. Peng, and R. Chen, “Improved anchor-free instance segmentation for building extraction from high-resolution remote sensing images,” Remote Sensing, vol. 12, no. 18, p. 2910, 2020.
- L. Xu, Y. Li, J. Xu, and L. Guo, “Gated spatial memory and centroid-aware network for building instance extraction,” IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–14, 2021.
- F. Shi and T. Zhang, “A multi-task network with distance–mask–boundary consistency constraints for building extraction from aerial images,” Remote Sensing, vol. 13, no. 14, p. 2656, 2021.
- B. Bischke, P. Helber, J. Folz, D. Borth, and A. Dengel, “Multi-task learning for segmentation of building footprints with deep neural networks,” in 2019 IEEE International Conference on Image Processing (ICIP). IEEE, 2019, pp. 1480–1484.
- Q. Zhu, C. Liao, H. Hu, X. Mei, and H. Li, “Map-net: Multiple attending path neural network for building footprint extraction from remote sensed imagery,” IEEE Transactions on Geoscience and Remote Sensing, vol. 59, no. 7, pp. 6169–6181, 2020.
- Q. Li, L. Mou, Y. Hua, Y. Shi, and X. X. Zhu, “Building footprint generation through convolutional neural networks with attraction field representation,” IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–17, 2021.
- S. Chen, W. Shi, M. Zhou, M. Zhang, and Z. Xuan, “Cgsanet: A contour-guided and local structure-aware encoder–decoder network for accurate building extraction from very high-resolution remote sensing imagery,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 15, pp. 1526–1542, 2021.
- K. Zhao, J. Kang, J. Jung, and G. Sohn, “Building extraction from satellite images using mask r-cnn with building boundary regularization,” in Proceedings of the IEEE conference on computer vision and pattern recognition workshops, 2018, pp. 247–251.
- S. Wei, S. Ji, and M. Lu, “Toward automatic building footprint delineation from aerial images using cnn and regularization,” IEEE Transactions on Geoscience and Remote Sensing, vol. 58, no. 3, pp. 2178–2189, 2019.
- S. Zorzi, K. Bittner, and F. Fraundorfer, “Machine-learned regularization and polygonization of building segmentation masks,” in 2020 25th International Conference on Pattern Recognition (ICPR). IEEE, 2021, pp. 3098–3105.
- S. Wei and S. Ji, “Graph convolutional networks for the automated production of building vector maps from aerial images,” IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–11, 2021.
- S. Wei, T. Zhang, and S. Ji, “A concentric loop convolutional neural network for manual delineation-level building boundary segmentation from remote-sensing images,” IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–11, 2021.
- Z. Xu, C. Xu, Z. Cui, X. Zheng, and J. Yang, “Cvnet: Contour vibration network for building extraction,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 1383–1391.
- Z. Li, J. D. Wegner, and A. Lucchi, “Topological map extraction from overhead images,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 1715–1724.
- S. Wei, T. Zhang, S. Ji, M. Luo, and J. Gong, “Buildmapper: A fully learnable framework for vectorized building contour extraction,” ISPRS Journal of Photogrammetry and Remote Sensing, vol. 197, pp. 87–104, 2023.
- S. Zorzi, S. Bazrafkan, S. Habenschuss, and F. Fraundorfer, “Polyworld: Polygonal building extraction with graph neural networks in satellite images,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 1848–1857.
- X. Shi, Z. Chen, H. Wang, D.-Y. Yeung, W.-K. Wong, and W.-c. Woo, “Convolutional lstm network: A machine learning approach for precipitation nowcasting,” Advances in neural information processing systems, vol. 28, 2015.
- S. P. Mohanty, J. Czakon, K. A. Kaczmarek, A. Pyskir, P. Tarasiewicz, S. Kunwar, J. Rohrbach, D. Luo, M. Prasad, S. Fleer et al., “Deep learning for understanding satellite imagery: An experimental survey,” Frontiers in Artificial Intelligence, vol. 3, p. 534696, 2020.
- N. Girard, D. Smirnov, J. Solomon, and Y. Tarabalka, “Polygonal building extraction by frame field learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 5891–5900.
- Y. Su, L. Gao, M. Jiang, A. Plaza, X. Sun, and B. Zhang, “Nsckl: Normalized spectral clustering with kernel-based learning for semisupervised hyperspectral image classification,” IEEE Transactions on Cybernetics, 2022.
- T. Guo, R. Wang, F. Luo, X. Gong, L. Zhang, and X. Gao, “Dual-view spectral and global spatial feature fusion network for hyperspectral image classification,” IEEE Transactions on Geoscience and Remote Sensing, 2023.
- S. Liu, H. Ye, K. Jin, and H. Cheng, “Ct-unet: Context-transfer-unet for building segmentation in remote sensing images,” Neural Processing Letters, vol. 53, pp. 4257–4277, 2021.
- L. Mou and X. X. Zhu, “Rifcn: Recurrent network in fully convolutional network for semantic segmentation of high resolution remote sensing images,” arXiv preprint arXiv:1805.02091, 2018.
- J. Li, Y. Zhuang, S. Dong, P. Gao, H. Dong, H. Chen, L. Chen, and L. Li, “Hierarchical disentangling network for building extraction from very high resolution optical remote sensing imagery,” Remote Sensing, vol. 14, no. 7, p. 1767, 2022.
- S. Li, T. Bao, H. Liu, R. Deng, and H. Zhang, “Multilevel feature aggregated network with instance contrastive learning constraint for building extraction,” Remote Sensing, vol. 15, no. 10, p. 2585, 2023.
- Y. Xie, J. Zhu, Y. Cao, D. Feng, M. Hu, W. Li, Y. Zhang, and L. Fu, “Refined extraction of building outlines from high-resolution remote sensing imagery based on a multifeature convolutional neural network and morphological filtering,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 13, pp. 1842–1855, 2020.
- K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2961–2969.
- L. We, “Marching cubes: A high resolution 3d surface construction algorithm,” Comput Graph, vol. 21, pp. 163–169, 1987.
- D. H. Douglas and T. K. Peucker, “Algorithms for the reduction of the number of points required to represent a digitized line or its caricature,” Cartographica: the international journal for geographic information and geovisualization, vol. 10, no. 2, pp. 112–122, 1973.
- A. Hu, L. Wu, S. Chen, Y. Xu, H. Wang, and Z. Xie, “Boundary shape-preserving model for building mapping from high-resolution remote sensing images,” IEEE Transactions on Geoscience and Remote Sensing, 2023.
- H. Ling, J. Gao, A. Kar, W. Chen, and S. Fidler, “Fast interactive object annotation with curve-gcn,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 5257–5266.
- B. Xu, J. Xu, N. Xue, and G.-S. Xia, “Accurate polygonal mapping of buildings in satellite imagery,” arXiv preprint arXiv:2208.00609, 2022.
- W. Li, W. Zhao, J. Yu, J. Zheng, C. He, H. Fu, and D. Lin, “Joint semantic–geometric learning for polygonal building segmentation from high-resolution remote sensing images,” ISPRS Journal of Photogrammetry and Remote Sensing, vol. 201, pp. 26–37, 2023.
- M. Khomiakov, M. R. Andersen, and J. Frellsen, “Polygonizer: An auto-regressive building delineator,” arXiv preprint arXiv:2304.04048, 2023.
- W. Zhao, C. Persello, and A. Stein, “Building outline delineation: From aerial images to polygons with an improved end-to-end learning framework,” ISPRS journal of photogrammetry and remote sensing, vol. 175, pp. 119–131, 2021.
- L. Castrejon, K. Kundu, R. Urtasun, and S. Fidler, “Annotating object instances with a polygon-rnn,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 5230–5238.
- D. Acuna, H. Ling, A. Kar, and S. Fidler, “Efficient interactive annotation of segmentation datasets with polygon-rnn++,” in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2018, pp. 859–868.
- M. Zhang, Q. Liu, W. Wang, and Y. Wang, “Transbuilding: An end-to-end polygonal building extraction with transformers,” in 2023 IEEE International Conference on Image Processing (ICIP). IEEE, 2023, pp. 460–464.
- R. Girshick, “Fast r-cnn,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1440–1448.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
- K. Han, Y. Wang, H. Chen, X. Chen, J. Guo, Z. Liu, Y. Tang, A. Xiao, C. Xu, Y. Xu et al., “A survey on vision transformer,” IEEE transactions on pattern analysis and machine intelligence, vol. 45, no. 1, pp. 87–110, 2022.
- A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
- Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 10 012–10 022.
- H. Wu, B. Xiao, N. Codella, M. Liu, X. Dai, L. Yuan, and L. Zhang, “Cvt: Introducing convolutions to vision transformers,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 22–31.
- W. Wang, E. Xie, X. Li, D.-P. Fan, K. Song, D. Liang, T. Lu, P. Luo, and L. Shao, “Pvt v2: Improved baselines with pyramid vision transformer,” Computational Visual Media, vol. 8, no. 3, pp. 415–424, 2022.
- Z. Liu, H. Hu, Y. Lin, Z. Yao, Z. Xie, Y. Wei, J. Ning, Y. Cao, Z. Zhang, L. Dong et al., “Swin transformer v2: Scaling up capacity and resolution,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 12 009–12 019.
- Z. Xia, X. Pan, S. Song, L. E. Li, and G. Huang, “Vision transformer with deformable attention,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 4794–4803.
- N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in European conference on computer vision. Springer, 2020, pp. 213–229.
- S. Zheng, J. Lu, H. Zhao, X. Zhu, Z. Luo, Y. Wang, Y. Fu, J. Feng, T. Xiang, P. H. Torr et al., “Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 6881–6890.
- Y. Wang, Z. Xu, X. Wang, C. Shen, B. Cheng, H. Shen, and H. Xia, “End-to-end video instance segmentation with transformers,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 8741–8750.
- A. Alfieri, Y. Lin, and J. C. van Gemert, “Investigating transformers in the decomposition of polygonal shapes as point collections,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 2076–2085.
- A. Sherstinsky, “Fundamentals of recurrent neural network (rnn) and long short-term memory (lstm) network,” Physica D: Nonlinear Phenomena, vol. 404, p. 132306, 2020.
- K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
- X. Glorot, A. Bordes, and Y. Bengio, “Deep sparse rectifier neural networks,” in Proceedings of the fourteenth international conference on artificial intelligence and statistics. JMLR Workshop and Conference Proceedings, 2011, pp. 315–323.
- T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2980–2988.
- I. Loshchilov and F. Hutter, “Fixing weight decay regularization in adam,” in International Conference on Learning Representations, 2018.
- E. Maggiori, Y. Tarabalka, G. Charpiat, and P. Alliez, “Can semantic labeling methods generalize to any city? the inria aerial image labeling benchmark,” in 2017 IEEE International Geoscience and Remote Sensing Symposium (IGARSS). IEEE, 2017, pp. 3226–3229.
- T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13. Springer, 2014, pp. 740–755.
- S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia, “Path aggregation network for instance segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 8759–8768.