Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
120 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
3 tokens/sec
DeepSeek R1 via Azure Pro
55 tokens/sec
2000 character limit reached

AsymFormer: Asymmetrical Cross-Modal Representation Learning for Mobile Platform Real-Time RGB-D Semantic Segmentation (2309.14065v7)

Published 25 Sep 2023 in cs.CV

Abstract: Understanding indoor scenes is crucial for urban studies. Considering the dynamic nature of indoor environments, effective semantic segmentation requires both real-time operation and high accuracy.To address this, we propose AsymFormer, a novel network that improves real-time semantic segmentation accuracy using RGB-D multi-modal information without substantially increasing network complexity. AsymFormer uses an asymmetrical backbone for multimodal feature extraction, reducing redundant parameters by optimizing computational resource distribution. To fuse asymmetric multimodal features, a Local Attention-Guided Feature Selection (LAFS) module is used to selectively fuse features from different modalities by leveraging their dependencies. Subsequently, a Cross-Modal Attention-Guided Feature Correlation Embedding (CMA) module is introduced to further extract cross-modal representations. The AsymFormer demonstrates competitive results with 54.1% mIoU on NYUv2 and 49.1% mIoU on SUNRGBD. Notably, AsymFormer achieves an inference speed of 65 FPS (79 FPS after implementing mixed precision quantization) on RTX3090, demonstrating that AsymFormer can strike a balance between high accuracy and efficiency.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (51)
  1. Multimae: Multi-modal multi-task masked autoencoders. In European Conference on Computer Vision, pages 348–367. Springer, 2022.
  2. Arkitscenes: A diverse real-world dataset for 3d indoor scene understanding using mobile rgb-d data. arXiv preprint arXiv:2111.08897, 2021.
  3. Shapeconv: Shape-aware convolutional layer for indoor rgb-d semantic segmentation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 7088–7097, 2021.
  4. Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587, 2017.
  5. Bi-directional cross-modality feature propagation with separation-and-aggregation gate for rgb-d semantic segmentation. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XI, pages 561–577. Springer, 2020.
  6. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
  7. Acnet: Strengthening the kernel skeletons for powerful cnn via asymmetric convolution blocks. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1911–1920, 2019.
  8. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  9. Pscnet: Efficient rgb-d semantic segmentation parallel network based on spatial and channel attention. ISPRS Annals of Photogrammetry, Remote Sensing & Spatial Information Sciences, (1), 2022.
  10. Omnivore: A single model for many visual modalities. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16102–16112, 2022.
  11. Fusenet: Incorporating depth into semantic segmentation via fusion-based cnn architecture. In Computer Vision–ACCV 2016: 13th Asian Conference on Computer Vision, Taipei, Taiwan, November 20-24, 2016, Revised Selected Papers, Part I 13, pages 213–228. Springer, 2017.
  12. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  13. Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13713–13722, 2021.
  14. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7132–7141, 2018.
  15. Temporally distributed networks for fast video semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8818–8827, 2020.
  16. Acnet: Attention based network to exploit complementary features for rgbd semantic segmentation. In 2019 IEEE International Conference on Image Processing (ICIP), pages 1440–1444. IEEE, 2019.
  17. Rednet: Residual encoder-decoder network for indoor rgb-d semantic segmentation. arXiv preprint arXiv:1806.01054, 2018.
  18. Multi-scale fusion for rgb-d indoor semantic segmentation. Scientific Reports, 12(1):20305, 2022.
  19. Next-vit: Next generation vision transformer for efficient deployment in realistic industrial scenarios. arXiv preprint arXiv:2207.05501, 2022.
  20. Cmx: Cross-modal fusion for rgb-x semantic segmentation with transformers. arXiv preprint arXiv:2203.04838, 2022a.
  21. A convnet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11976–11986, 2022b.
  22. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3431–3440, 2015.
  23. Light-weight refinenet for real-time semantic segmentation. arXiv preprint arXiv:1810.03272, 2018.
  24. Real-time joint semantic segmentation and depth estimation using asymmetric annotations. In 2019 International Conference on Robotics and Automation (ICRA), pages 7101–7107. IEEE, 2019.
  25. Rdfnet: Rgb-d multi-level residual feature fusion for indoor semantic segmentation. In Proceedings of the IEEE international conference on computer vision, pages 4980–4989, 2017.
  26. Efficient rgb-d semantic segmentation for indoor scene analysis. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pages 13525–13531. IEEE, 2021.
  27. Efficient multi-task rgb-d scene analysis for indoor environments. In 2022 International Joint Conference on Neural Networks (IJCNN), pages 1–10. IEEE, 2022.
  28. Indoor segmentation and support inference from rgbd images. ECCV (5), 7576:746–760, 2012.
  29. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  30. Sun rgb-d: A rgb-d scene understanding benchmark suite. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 567–576, 2015.
  31. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  32. Non-local neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7794–7803, 2018.
  33. Multimodal token fusion for vision transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12186–12195, 2022.
  34. Cbam: Convolutional block attention module. In Proceedings of the European conference on computer vision (ECCV), pages 3–19, 2018.
  35. Segformer: Simple and efficient design for semantic segmentation with transformers. Advances in Neural Information Processing Systems, 34:12077–12090, 2021.
  36. Semantic scene segmentation for indoor robot navigation via deep learning. In Proceedings of the 3rd International Conference on Robotics, Control and Automation, pages 112–118, 2018.
  37. Dformer: Rethinking rgbd representation learning for semantic segmentation. arXiv preprint arXiv:2309.09668, 2023.
  38. Indoor ar navigation and emergency evacuation system based on machine learning and iot technologies. IEEE Internet of Things Journal, 9(21):20853–20868, 2022.
  39. Bisenet: Bilateral segmentation network for real-time semantic segmentation. In Proceedings of the European conference on computer vision (ECCV), pages 325–341, 2018.
  40. Bisenet v2: Bilateral network with guided aggregation for real-time semantic segmentation. International Journal of Computer Vision, 129:3051–3068, 2021.
  41. Metaformer is actually what you need for vision. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10819–10829, 2022.
  42. Cmx: Cross-modal fusion for rgb-x semantic segmentation with transformers. IEEE Transactions on Intelligent Transportation Systems, 2023.
  43. Physically-based rendering for indoor scene understanding using convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5287–5295, 2017.
  44. Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2881–2890, 2017.
  45. Beyond point clouds: Scene understanding by reasoning geometry and physics. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3127–3134, 2013.
  46. Scene parsing through ade20k dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 633–641, 2017.
  47. Canet: Co-attention network for rgb-d semantic segmentation. Pattern Recognition, 124:108468, 2022a.
  48. Tsnet: Three-stream self-attention network for rgb-d indoor semantic segmentation. IEEE intelligent systems, 36(4):73–78, 2020.
  49. Pgdenet: Progressive guided fusion and depth enhancement network for rgb-d indoor scene parsing. IEEE Transactions on Multimedia, 2022b.
  50. Frnet: Feature reconstruction network for rgb-d indoor scene parsing. IEEE Journal of Selected Topics in Signal Processing, 16(4):677–687, 2022c.
  51. Bcinet: Bilateral cross-modal interaction network for indoor scene understanding in rgb-d images. Information Fusion, 78:84–94, 2023.
Citations (7)

Summary

  • The paper introduces AsymFormer, an efficient network for real-time RGB-D segmentation using an asymmetrical backbone and attention-guided modules.
  • The methodology leverages a Local Attention-Guided Feature Selection module and a Cross-Modal Attention module for dynamic, multi-modal feature fusion.
  • Experimental results on NYUv2 and SUNRGBD demonstrate competitive accuracy and inference speed, enabling effective mobile platform deployment.

AsymFormer: Asymmetrical Cross-Modal Representation Learning for Mobile Platform Real-Time RGB-D Semantic Segmentation

The paper "AsymFormer: Asymmetrical Cross-Modal Representation Learning for Mobile Platform Real-Time RGB-D Semantic Segmentation" (2309.14065) introduces AsymFormer, a novel and efficient network designed for real-time semantic segmentation using RGB-D multi-modal information. The network employs an asymmetrical backbone for feature extraction, a Local Attention-Guided Feature Selection (LAFS) module for selective feature fusion, and a Cross-Modal Attention-Guided Feature Correlation Embedding (CMA) module for cross-modal representation learning. The authors demonstrate the efficacy of AsymFormer on the NYUv2 and SUNRGBD datasets, achieving a balance between accuracy and inference speed suitable for mobile platform deployment.

Motivation and Background

Real-time semantic segmentation of indoor scenes is critical for various applications, including emergency evacuation [yoo2022indoor], robotic navigation [yeboah2018semantic], and virtual reality [zhang2017physically]. While existing methods often struggle to balance accuracy and efficiency, this work addresses the challenge by leveraging RGB-D data and optimizing computational resource allocation. The authors highlight the redundancy introduced by symmetric backbones in existing RGB-D segmentation networks [du2022pscnet] and propose an asymmetric design to mitigate this issue.

AsymFormer Architecture and Key Components

The AsymFormer architecture consists of three main components: an asymmetric backbone, a Local Attention-Guided Feature Selection (LAFS) module, and a Cross-Modal Attention (CMA) module (Figure 1). Figure 1

Figure 1: An overview of the AsymFormer architecture, illustrating the flow of information from the asymmetric backbone through the LAFS and CMA modules.

Asymmetric Backbone

The asymmetric backbone is a core innovation, employing a larger, more parameter-rich CNN (ConvNeXt [liu2022convnet]) for RGB feature extraction and a lightweight Transformer (Mix-Transformer [xie2021segformer]) for Depth feature extraction. This design choice reflects the observation that RGB information typically plays a more prominent role in semantic segmentation tasks [du2022pscnet]. By allocating more computational resources to the RGB branch, the network reduces redundant parameters and improves efficiency.

Local Attention-Guided Feature Selection (LAFS)

The LAFS module is designed to selectively fuse features from different modalities by leveraging their dependencies. Unlike existing attention mechanisms that use fixed strategies for feature compression [woo2018cbam], LAFS employs a learnable method for spatial information compression. This is achieved through a feedforward neural network that learns dynamic spatial information compression rules, enabling the network to adaptively select and fuse features based on their relevance. The LAFS module computes attention weights using Adaptive Average Pooling to extract a global information vector AvgAvg. Spatial attention weights WSW_S are then calculated using the following formula:

WS=Sigmoid(Dot(Input.Reshape(C,H×W)T,RAvg)C2)W_{S}=\text{Sigmoid}(\frac{\text{Dot}(Input.\text{Reshape}(C,H \times W)^T,R_{Avg})}{C^{2}})

This allows the network to focus on the most informative regions of the feature maps. (Figure 2) shows the details of the LAFS module. Figure 2

Figure 2: An illustration of the LAFS module, detailing the learnable method for spatial information compression.

Cross-Modal Attention (CMA)

The CMA module is introduced to further extract cross-modal representations by embedding cross-modal information into pixel-wise fused features. The key to CMA is defining cross-modal self-similarity using a linear sum and embedding its result into the fused features. For a pixel (i0,j0)(i_0, j_0), its cross-modal self-similarity with other pixels (i,j)(i, j) is defined as:

W(i,j)=n=1N(Krn,i,jQrn,i0,j0)+n=1N(Kdn,i,jQdn,i0,j0)W(i,j)=\sum_{n=1}^{N}(Kr_{n,i,j} \cdot Qr_{n,i_0,j_0})+\sum_{n=1}^{N}(Kd_{n,i,j} \cdot Qd_{n,i_0,j_0})

The CMA module has three input features: RGB features, Depth features, and the fused features selected by LAFS. These features are embedded into a vector space, and the embedded features are split into two independent vectors. A shuffle mechanism is introduced to ensure that each vector contains information from both modalities, enabling the network to learn features from multiple subspaces. (Figure 3) shows the feature embedding process. Figure 3

Figure 3: Feature embedding in the CMA module, showcasing the splitting of the output Value into two independent vectors.

(Figure 4) illustrates the splitting and mixing of multimodal information. Figure 4

Figure 4: Splitting and mixing of multimodal information within the CMA module, ensuring that each subspace contains information from both RGB and Depth modalities.

Experimental Results

The authors evaluated AsymFormer on the NYUv2 and SUNRGBD datasets, demonstrating its competitive performance in terms of accuracy and inference speed. On NYUv2, AsymFormer achieved 54.1\% mIoU with an inference speed of 65 FPS on an RTX 3090 GPU. With mixed precision quantization, the inference speed further increased to 79 FPS. On SUNRGBD, AsymFormer achieved 49.1\% mIoU. Ablation studies validate the effectiveness of the LAFS and CMA modules. The attention map of the LAFS module is visualized and compared against CBAM in (Figure 5). Figure 5

Figure 5: A comparison of the spatial attention weights generated by CBAM and LAFS, highlighting the improved coverage and consistency of LAFS.

(Figure 6) visualizes the semantic segmentation results of AsymFormer on the NYUv2 dataset. Figure 6

Figure 6: Visualization of semantic segmentation results on the NYUv2 dataset, showcasing the accuracy of AsymFormer in identifying various objects and regions within indoor scenes.

Conclusion

The AsymFormer architecture presents a compelling approach to real-time RGB-D semantic segmentation by balancing accuracy and efficiency. The asymmetric backbone, LAFS module, and CMA module collectively contribute to reducing redundant parameters and improving feature representation. Experimental results on NYUv2 and SUNRGBD demonstrate the potential of AsymFormer for deployment on mobile platforms and real-time applications. Future work may focus on self-supervised pre-training and further optimization of the network architecture to achieve even greater improvements in performance.