Not All Voxels Are Equal: Hardness-Aware Semantic Scene Completion with Self-Distillation (2404.11958v1)
Abstract: Semantic scene completion, also known as semantic occupancy prediction, can provide dense geometric and semantic information for autonomous vehicles, which attracts the increasing attention of both academia and industry. Unfortunately, existing methods usually formulate this task as a voxel-wise classification problem and treat each voxel equally in 3D space during training. As the hard voxels have not been paid enough attention, the performance in some challenging regions is limited. The 3D dense space typically contains a large number of empty voxels, which are easy to learn but require amounts of computation due to handling all the voxels uniformly for the existing models. Furthermore, the voxels in the boundary region are more challenging to differentiate than those in the interior. In this paper, we propose HASSC approach to train the semantic scene completion model with hardness-aware design. The global hardness from the network optimization process is defined for dynamical hard voxel selection. Then, the local hardness with geometric anisotropy is adopted for voxel-wise refinement. Besides, self-distillation strategy is introduced to make training process stable and consistent. Extensive experiments show that our HASSC scheme can effectively promote the accuracy of the baseline model without incurring the extra inference cost. Source code is available at: https://github.com/songw-zju/HASSC.
- SemanticKITTI: A dataset for semantic scene understanding of LiDAR sequences. In ICCV, pages 9297–9307, 2019.
- Model compression. In SIGKDD, pages 535–541, 2006.
- Semantic scene completion via integrating instances and scene in-the-loop. In CVPR, pages 324–333, 2021.
- Anh-Quan Cao and Raoul de Charette. MonoScene: Monocular 3D semantic scene completion. In CVPR, pages 3991–4001, 2022.
- 3d sketch-aware semantic scene completion via semi-supervised structure prior. In CVPR, pages 4193–4202, 2020.
- Bevdistill: Cross-modal bev distillation for multi-view 3d object detection. In ICLR, 2022.
- S3CNet: A sparse semantic scene completion network for LiDAR point cloud. In CoRL, pages 2148–2161, 2021.
- itkd: Interchange transfer-based knowledge distillation for 3d object detection. In CVPR, pages 13540–13549, 2023.
- Nightlab: A dual-level architecture with hardness detection for segmentation at night. In CVPR, pages 16938–16948, 2022.
- Born again neural networks. In ICML, pages 1607–1616, 2018.
- Are we ready for autonomous driving? The KITTI vision benchmark suite. In CVPR, pages 3354–3361, 2012.
- S4c: Self-supervised semantic scene completion with neural fields. In 3DV, 2024.
- Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
- Momentum contrast for unsupervised visual representation learning. In CVPR, pages 9729–9738, 2020.
- A comprehensive overhaul of feature distillation. In ICCV, pages 1921–1930, 2019.
- Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
- Bevdet: High-performance multi-camera 3d object detection in bird-eye-view. arXiv preprint arXiv:2112.11790, 2021.
- Tri-perspective view for vision-based 3D semantic occupancy prediction. In CVPR, pages 9223–9232, 2023.
- Refine myself by teaching myself: Feature refinement via self-knowledge distillation. In CVPR, pages 10664–10673, 2021.
- Pointrend: Image segmentation as rendering. In CVPR, pages 9799–9808, 2020.
- Discobox: Weakly supervised instance segmentation and semantic correspondence from box supervision. In ICCV, pages 3406–3416, 2021.
- Stereoscene: Bev-assisted stereo matching empowers 3d semantic scene completion. arXiv preprint arXiv:2303.13959, 2023a.
- Depth based semantic scene completion with position importance aware loss. IEEE RA-L, 5(1):219–226, 2019.
- Anisotropic convolutional networks for 3d semantic scene completion. In CVPR, pages 3351–3359, 2020a.
- Self-distillation for robust lidar semantic segmentation in autonomous driving. In ECCV, pages 659–676, 2022a.
- Lode: Locally conditioned eikonal implicit scene completion from sparse lidar. In ICRA, 2023b.
- Attention-based multi-modal fusion network for semantic scene completion. In AAAI, pages 11402–11409, 2020b.
- Not all pixels are equal: Difficulty-aware semantic segmentation via deep layer cascade. In CVPR, pages 3193–3202, 2017.
- Bevdepth: Acquisition of reliable depth for multi-view 3d object detection. In AAAI, pages 1477–1485, 2023c.
- VoxFormer: Sparse voxel transformer for camera-based 3D semantic scene completion. In CVPR, pages 9087–9098, 2023d.
- Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In ECCV, pages 1–18, 2022b.
- Fb-occ: 3d occupancy prediction based on forward-backward view transformation. arXiv preprint arXiv:2307.01492, 2023e.
- Focal loss for dense object detection. In ICCV, pages 2980–2988, 2017.
- See and think: Disentangling semantic scene completion. In NeurIPS, 2018.
- Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1):99–106, 2021.
- Semanticposs: A point cloud dataset with large quantity of dynamic instances. In IV, pages 687–693, 2020.
- Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d. In ECCV, pages 194–210, 2020.
- Pointnet: Deep learning on point sets for 3d classification and segmentation. In CVPR, pages 652–660, 2017.
- Semantic scene completion using local deep implicit functions on lidar data. T-PAMI, 44(10):7205–7218, 2021.
- LMSCNet: Lightweight multiscale 3D semantic completion. In 3DV, pages 111–119, 2020.
- 3d semantic scene completion: A survey. IJCV, 130(8):1978–2005, 2022.
- Training region-based object detectors with online hard example mining. In CVPR, pages 761–769, 2016.
- Semantic scene completion from a single depth image. In CVPR, pages 1746–1754, 2017.
- Efficientnet: Rethinking model scaling for convolutional neural networks. In ICML, pages 6105–6114, 2019.
- Not all voxels are equal: Semantic scene completion from the point-voxel perspective. In AAAI, pages 2352–2360, 2022.
- Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In NeurIPS, 2017.
- Meta-rangeseg: Lidar sequence semantic segmentation using multiple feature aggregation. IEEE RA-L, 7(4):9739–9746, 2022a.
- Lidar2map: In defense of lidar-based semantic map construction using online camera distillation. In CVPR, pages 5186–5195, 2023.
- Detr3d: 3d object detection from multi-view images via 3d-to-2d queries. In CoRL, pages 180–191, 2022b.
- Motionsc: Data set and network for real-time semantic mapping in dynamic environments. IEEE RA-L, 7(3):8439–8446, 2022.
- SCPNet: Semantic scene completion on point cloud. In CVPR, pages 17642–17651, 2023.
- Not all pixels are equal: Learning pixel hardness for semantic segmentation. arXiv preprint arXiv:2305.08462, 2023.
- Sparse single sweep LiDAR point cloud segmentation via learning contextual shape priors from scene completion. In AAAI, pages 3101–3109, 2021.
- Towards efficient 3d object detection with knowledge distillation. In NeurIPS, pages 21300–21313, 2022a.
- Masked generative distillation. In ECCV, pages 53–69, 2022b.
- Ndc-scene: Boost monocular 3d semantic scene completion in normalized device coordinates space. In ICCV, pages 9455–9465, 2023.
- Online hard region mining for semantic segmentation. Neural Processing Letters, 50:2665–2679, 2019.
- Distilling focal knowledge from imperfect expert for 3d object detection. In CVPR, pages 992–1001, 2023.
- Be your own teacher: Improve the performance of convolutional neural networks via self distillation. In ICCV, pages 3713–3722, 2019.
- OccFormer: Dual-path transformer for vision-based 3D semantic occupancy prediction. In ICCV, 2023.
- Decoupled knowledge distillation. In CVPR, pages 11953–11962, 2022.
- Unidistill: A universal cross-modality knowledge distillation framework for 3d object detection in bird’s-eye view. In CVPR, pages 5116–5125, 2023.
- Deformable DETR: Deformable transformers for end-to-end object detection. In ICLR, 2021a.
- Cylindrical and asymmetrical 3D convolution networks for lidar segmentation. In CVPR, pages 9939–9948, 2021b.