OA-CNNs: Omni-Adaptive Sparse CNNs for 3D Semantic Segmentation (2403.14418v1)
Abstract: The booming of 3D recognition in the 2020s began with the introduction of point cloud transformers. They quickly overwhelmed sparse CNNs and became state-of-the-art models, especially in 3D semantic segmentation. However, sparse CNNs are still valuable networks, due to their efficiency treasure, and ease of application. In this work, we reexamine the design distinctions and test the limits of what a sparse CNN can achieve. We discover that the key credit to the performance difference is adaptivity. Specifically, we propose two key components, i.e., adaptive receptive fields (spatially) and adaptive relation, to bridge the gap. This exploration led to the creation of Omni-Adaptive 3D CNNs (OA-CNNs), a family of networks that integrates a lightweight module to greatly enhance the adaptivity of sparse CNNs at minimal computational cost. Without any self-attention modules, OA-CNNs favorably surpass point transformers in terms of accuracy in both indoor and outdoor scenes, with much less latency and memory cost. Notably, it achieves 76.1%, 78.9%, and 70.6% mIoU on ScanNet v2, nuScenes, and SemanticKITTI validation benchmarks respectively, while maintaining at most 5x better speed than transformer counterparts. This revelation highlights the potential of pure sparse CNNs to outperform transformer-related networks.
- 3d semantic parsing of large-scale indoor spaces. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1534–1543, 2016.
- Semantickitti: A dataset for semantic scene understanding of lidar sequences. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9297–9307, 2019.
- 3dmfv: Three-dimensional point cloud classification in real-time using convolutional neural networks. IEEE Robotics and Automation Letters, 3(4):3145–3152, 2018.
- nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11621–11631, 2020.
- Multi-view 3d object detection network for autonomous driving. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 1907–1915, 2017.
- Scaling up kernels in 3d cnns. CoRR, abs/2206.10555, 2022.
- 4d spatio-temporal convnets: Minkowski convolutional neural networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3075–3084, 2019.
- Group equivariant convolutional networks. In International conference on machine learning, pages 2990–2999. PMLR, 2016.
- Pointcept Contributors. Pointcept: A codebase for point cloud perception research. https://github.com/Pointcept/Pointcept, 2023.
- Reslt: Residual learning for long-tailed recognition. TPAMI, 2023.
- Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5828–5839, 2017.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- 3d semantic segmentation with submanifold sparse convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 9224–9232, 2018.
- Randla-net: Efficient semantic segmentation of large-scale point clouds. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11108–11117, 2020.
- Recurrent slice networks for 3d segmentation of point clouds. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2626–2635, 2018.
- Openins3d: Snap and lookup for 3d open-vocabulary instance segmentation. arXiv preprint arXiv:2309.00616, 2023.
- Dynamic filter networks. Advances in neural information processing systems, 29, 2016.
- Hierarchical point-edge interaction network for point cloud semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10433–10441, 2019.
- Guided point contrastive learning for semi-supervised point cloud semantic segmentation. In ICCV, 2021.
- Semi-supervised semantic segmentation with directional context-aware consistency. In CVPR, 2021.
- Stratified transformer for 3d point cloud segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8500–8509, 2022.
- Spherical transformer for lidar-based 3d recognition. In CVPR, 2023a.
- LISA: reasoning segmentation via large language model. arXiv preprint arXiv:2308.00692, 2023b.
- Pointpillars: Fast encoders for object detection from point clouds. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12697–12705, 2019.
- Deep projective 3d semantic segmentation. In Computer Analysis of Images and Patterns: 17th International Conference, CAIP 2017, Ystad, Sweden, August 22-24, 2017, Proceedings, Part I 17, pages 95–107. Springer, 2017.
- Vehicle detection from 3d lidar using fully convolutional network. arXiv preprint arXiv:1608.07916, 2016.
- Selective kernel networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 510–519, 2019.
- Pointcnn: Convolution on x-transformed points. Advances in neural information processing systems, 31, 2018.
- Fpconv: Learning local flattening for point convolution. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4293–4302, 2020.
- Spatial pruned sparse convolution for efficient 3d object detection. In NeurIPS, 2022a.
- Densepoint: Learning densely contextual representation for efficient point cloud processing. In Proceedings of the IEEE/CVF international conference on computer vision, pages 5239–5248, 2019.
- Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021.
- A convnet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11976–11986, 2022b.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
- Understanding the effective receptive field in deep convolutional neural networks. Advances in neural information processing systems, 29, 2016.
- Pfenet++: Boosting few-shot semantic segmentation with the noise-filtered context-aware prior mask. TPAMI, 2024.
- Voxel transformer for 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3164–3173, 2021.
- Voxnet: A 3d convolutional neural network for real-time object recognition. In 2015 IEEE/RSJ international conference on intelligent robots and systems (IROS), pages 922–928. IEEE, 2015.
- Vv-net: Voxel vae net with group convolutions for point cloud segmentation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 8500–8508, 2019.
- Fast point transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16949–16958, 2022.
- Hierarchical dense correlation distillation for few-shot segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23641–23651, 2023.
- Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 652–660, 2017a.
- Pointnet++: Deep hierarchical feature learning on point sets in a metric space. Advances in neural information processing systems, 30, 2017b.
- Pointnext: Revisiting pointnet++ with improved training and scaling strategies. arXiv preprint arXiv:2206.04670, 2022.
- Language-grounded indoor 3d semantic segmentation in the wild. In Proceedings of the European Conference on Computer Vision (ECCV), 2022.
- Semantic scene completion from a single depth image. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1746–1754, 2017.
- Multi-view convolutional neural networks for 3d shape recognition. In Proceedings of the IEEE international conference on computer vision, pages 945–953, 2015.
- Searching efficient 3d architectures with sparse point-voxel convolution. In European Conference on Computer Vision, 2020.
- Kpconv: Flexible and deformable convolution for point clouds. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6411–6420, 2019.
- Learning shape-aware embedding for scene text detection. In CVPR, 2019.
- Generalized few-shot semantic segmentation. In CVPR, 2022a.
- Prior guided feature enrichment network for few-shot segmentation. TPAMI, 2022b.
- Adaptive perspective distillation for semantic segmentation. TPAMI, 2023a.
- Learning context-aware classifier for semantic segmentation. In AAAI, 2023b.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Groupcontrast: Semantic-aware self-supervised representation learning for 3d understanding. arXiv preprint arXiv:2403.09639, 2024.
- Peng-Shuai Wang. Octformer: Octree-based transformers for 3d point clouds. arXiv preprint arXiv:2305.03045, 2023.
- Convnext v2: Co-designing and scaling convnets with masked autoencoders. arXiv preprint arXiv:2301.00808, 2023.
- Pay less attention with lightweight and dynamic convolutions. arXiv preprint arXiv:1901.10430, 2019a.
- Pointconv: Deep convolutional networks on 3d point clouds. In Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pages 9621–9630, 2019b.
- Point transformer v2: Grouped vector attention and partition-based pooling. arXiv preprint arXiv:2210.05666, 2022.
- Masked scene contrast: A scalable framework for unsupervised 3d representation learning. In Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, 2023.
- Point transformer v3: Simpler, faster, stronger. In CVPR, 2024a.
- Towards large-scale 3d representation learning with multi-dataset point prompt training. In CVPR, 2024b.
- Paconv: Position adaptive convolution with dynamic kernel assembling on point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3173–3182, 2021.
- Second: Sparsely embedded convolutional detection. Sensors, 18(10):3337, 2018.
- Condconv: Conditionally parameterized convolutions for efficient inference. Advances in Neural Information Processing Systems, 32, 2019.
- Unipad: A universal pre-training paradigm for autonomous driving. In CVPR, 2024.
- An improved baseline for reasoning segmentation with large language model. arXiv preprint arXiv:2312.17240, 2023a.
- Exploring sparse visual prompt for cross-domain semantic segmentation. arXiv e-prints, pages arXiv–2303, 2023b.
- Sam3d: Segment anything in 3d scenes. arXiv preprint arXiv:2306.03908, 2023c.
- Deep learning for medical image segmentation: Tricks, challenges and future directions. arXiv preprint arXiv:2209.10307, 2022.
- Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2881–2890, 2017.
- Point transformer. In Proceedings of the IEEE/CVF international conference on computer vision, pages 16259–16268, 2021.
- Understanding imbalanced semantic segmentation through neural collapse. 2023.
- Ponderv2: Pave the way for 3d foundation model with a universal pre-training paradigm. arXiv preprint arXiv:2310.08586, 2023.
- Cylindrical and asymmetrical 3d convolution networks for lidar segmentation. arXiv preprint arXiv:2011.10033, 2020.