AsymFormer: Asymmetrical Cross-Modal Representation Learning for Mobile Platform Real-Time RGB-D Semantic Segmentation (2309.14065v7)
Abstract: Understanding indoor scenes is crucial for urban studies. Considering the dynamic nature of indoor environments, effective semantic segmentation requires both real-time operation and high accuracy.To address this, we propose AsymFormer, a novel network that improves real-time semantic segmentation accuracy using RGB-D multi-modal information without substantially increasing network complexity. AsymFormer uses an asymmetrical backbone for multimodal feature extraction, reducing redundant parameters by optimizing computational resource distribution. To fuse asymmetric multimodal features, a Local Attention-Guided Feature Selection (LAFS) module is used to selectively fuse features from different modalities by leveraging their dependencies. Subsequently, a Cross-Modal Attention-Guided Feature Correlation Embedding (CMA) module is introduced to further extract cross-modal representations. The AsymFormer demonstrates competitive results with 54.1% mIoU on NYUv2 and 49.1% mIoU on SUNRGBD. Notably, AsymFormer achieves an inference speed of 65 FPS (79 FPS after implementing mixed precision quantization) on RTX3090, demonstrating that AsymFormer can strike a balance between high accuracy and efficiency.
- Multimae: Multi-modal multi-task masked autoencoders. In European Conference on Computer Vision, pages 348–367. Springer, 2022.
- Arkitscenes: A diverse real-world dataset for 3d indoor scene understanding using mobile rgb-d data. arXiv preprint arXiv:2111.08897, 2021.
- Shapeconv: Shape-aware convolutional layer for indoor rgb-d semantic segmentation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 7088–7097, 2021.
- Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587, 2017.
- Bi-directional cross-modality feature propagation with separation-and-aggregation gate for rgb-d semantic segmentation. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XI, pages 561–577. Springer, 2020.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
- Acnet: Strengthening the kernel skeletons for powerful cnn via asymmetric convolution blocks. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1911–1920, 2019.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- Pscnet: Efficient rgb-d semantic segmentation parallel network based on spatial and channel attention. ISPRS Annals of Photogrammetry, Remote Sensing & Spatial Information Sciences, (1), 2022.
- Omnivore: A single model for many visual modalities. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16102–16112, 2022.
- Fusenet: Incorporating depth into semantic segmentation via fusion-based cnn architecture. In Computer Vision–ACCV 2016: 13th Asian Conference on Computer Vision, Taipei, Taiwan, November 20-24, 2016, Revised Selected Papers, Part I 13, pages 213–228. Springer, 2017.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13713–13722, 2021.
- Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7132–7141, 2018.
- Temporally distributed networks for fast video semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8818–8827, 2020.
- Acnet: Attention based network to exploit complementary features for rgbd semantic segmentation. In 2019 IEEE International Conference on Image Processing (ICIP), pages 1440–1444. IEEE, 2019.
- Rednet: Residual encoder-decoder network for indoor rgb-d semantic segmentation. arXiv preprint arXiv:1806.01054, 2018.
- Multi-scale fusion for rgb-d indoor semantic segmentation. Scientific Reports, 12(1):20305, 2022.
- Next-vit: Next generation vision transformer for efficient deployment in realistic industrial scenarios. arXiv preprint arXiv:2207.05501, 2022.
- Cmx: Cross-modal fusion for rgb-x semantic segmentation with transformers. arXiv preprint arXiv:2203.04838, 2022a.
- A convnet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11976–11986, 2022b.
- Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3431–3440, 2015.
- Light-weight refinenet for real-time semantic segmentation. arXiv preprint arXiv:1810.03272, 2018.
- Real-time joint semantic segmentation and depth estimation using asymmetric annotations. In 2019 International Conference on Robotics and Automation (ICRA), pages 7101–7107. IEEE, 2019.
- Rdfnet: Rgb-d multi-level residual feature fusion for indoor semantic segmentation. In Proceedings of the IEEE international conference on computer vision, pages 4980–4989, 2017.
- Efficient rgb-d semantic segmentation for indoor scene analysis. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pages 13525–13531. IEEE, 2021.
- Efficient multi-task rgb-d scene analysis for indoor environments. In 2022 International Joint Conference on Neural Networks (IJCNN), pages 1–10. IEEE, 2022.
- Indoor segmentation and support inference from rgbd images. ECCV (5), 7576:746–760, 2012.
- Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
- Sun rgb-d: A rgb-d scene understanding benchmark suite. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 567–576, 2015.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Non-local neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7794–7803, 2018.
- Multimodal token fusion for vision transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12186–12195, 2022.
- Cbam: Convolutional block attention module. In Proceedings of the European conference on computer vision (ECCV), pages 3–19, 2018.
- Segformer: Simple and efficient design for semantic segmentation with transformers. Advances in Neural Information Processing Systems, 34:12077–12090, 2021.
- Semantic scene segmentation for indoor robot navigation via deep learning. In Proceedings of the 3rd International Conference on Robotics, Control and Automation, pages 112–118, 2018.
- Dformer: Rethinking rgbd representation learning for semantic segmentation. arXiv preprint arXiv:2309.09668, 2023.
- Indoor ar navigation and emergency evacuation system based on machine learning and iot technologies. IEEE Internet of Things Journal, 9(21):20853–20868, 2022.
- Bisenet: Bilateral segmentation network for real-time semantic segmentation. In Proceedings of the European conference on computer vision (ECCV), pages 325–341, 2018.
- Bisenet v2: Bilateral network with guided aggregation for real-time semantic segmentation. International Journal of Computer Vision, 129:3051–3068, 2021.
- Metaformer is actually what you need for vision. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10819–10829, 2022.
- Cmx: Cross-modal fusion for rgb-x semantic segmentation with transformers. IEEE Transactions on Intelligent Transportation Systems, 2023.
- Physically-based rendering for indoor scene understanding using convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5287–5295, 2017.
- Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2881–2890, 2017.
- Beyond point clouds: Scene understanding by reasoning geometry and physics. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3127–3134, 2013.
- Scene parsing through ade20k dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 633–641, 2017.
- Canet: Co-attention network for rgb-d semantic segmentation. Pattern Recognition, 124:108468, 2022a.
- Tsnet: Three-stream self-attention network for rgb-d indoor semantic segmentation. IEEE intelligent systems, 36(4):73–78, 2020.
- Pgdenet: Progressive guided fusion and depth enhancement network for rgb-d indoor scene parsing. IEEE Transactions on Multimedia, 2022b.
- Frnet: Feature reconstruction network for rgb-d indoor scene parsing. IEEE Journal of Selected Topics in Signal Processing, 16(4):677–687, 2022c.
- Bcinet: Bilateral cross-modal interaction network for indoor scene understanding in rgb-d images. Information Fusion, 78:84–94, 2023.