Rethinking Dilated Convolution for Real-time Semantic Segmentation (2111.09957v3)
Abstract: The field-of-view is an important metric when designing a model for semantic segmentation. To obtain a large field-of-view, previous approaches generally choose to rapidly downsample the resolution, usually with average poolings or stride 2 convolutions. We take a different approach by using dilated convolutions with large dilation rates throughout the backbone, allowing the backbone to easily tune its field-of-view by adjusting its dilation rates, and show that it's competitive with existing approaches. To effectively use the dilated convolution, we show a simple upper bound on the dilation rate in order to not leave gaps in between the convolutional weights, and design an SE-ResNeXt inspired block structure that uses two parallel $3\times 3$ convolutions with different dilation rates to preserve the local details. Manually tuning the dilation rates for every block can be difficult, so we also introduce a differentiable neural architecture search method that uses gradient descent to optimize the dilation rates. In addition, we propose a lightweight decoder that restores local information better than common alternatives. To demonstrate the effectiveness of our approach, our model RegSeg achieves competitive results on real-time Cityscapes and CamVid datasets. Using a T4 GPU with mixed precision, RegSeg achieves 78.3 mIOU on Cityscapes test set at $37$ FPS, and 80.9 mIOU on CamVid test set at $112$ FPS, both without ImageNet pretraining.
- Segmentation and recognition using structure from motion point clouds. In ECCV, pages 44–57. Springer, 2008.
- Deep spatio-temporal random fields for efficient video segmentation. In CVPR, pages 8915–8924, 2018.
- Hardnet: A low memory traffic network. In ICCV, pages 3552–3561, 2019.
- Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587, 2017.
- Encoder-decoder with atrous separable convolution for semantic image segmentation. In ECCV, 2018.
- Panoptic-deeplab: A simple, strong, and fast baseline for bottom-up panoptic segmentation. In CVPR, 2020.
- The cityscapes dataset for semantic urban scene understanding. In CVPR, pages 3213–3223, 2016.
- Randaugment: Practical automated data augmentation with a reduced search space. In CVPR Workshops, pages 702–703, 2020.
- Deformable convolutional networks. In Proceedings of the IEEE international conference on computer vision, pages 764–773, 2017.
- Imagenet: A large-scale hierarchical image database. In CVPR, pages 248–255. Ieee, 2009.
- Fast and accurate model scaling. In CVPR, 2021.
- Rethinking bisenet for real-time semantic segmentation. In CVPR, pages 9716–9725, June 2021.
- Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.
- Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
- Bag of tricks for image classification with convolutional neural networks. In CVPR, pages 558–567, 2019.
- Deep dual-resolution networks for real-time and accurate semantic segmentation of road scenes. arXiv preprint arXiv:2101.06085, 2021.
- Searching for mobilenetv3. In ICCV, 2019.
- Squeeze-and-excitation networks. In CVPR, pages 7132–7141, 2018.
- Temporally distributed networks for fast video semantic segmentation. In CVPR, pages 8818–8827, 2020.
- Ccnet: Criss-cross attention for semantic segmentation. TPAMI, 2020.
- Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, pages 448–456. PMLR, 2015.
- Dfanet: Deep feature aggregation for real-time semantic segmentation. In CVPR, June 2019.
- Selective kernel networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 510–519, 2019.
- Semantic flow for fast and accurate scene parsing. In ECCV, pages 775–793. Springer, 2020.
- Graph-guided architecture search for real-time semantic segmentation. In CVPR, pages 4203–4212, 2020.
- Fully convolutional networks for semantic segmentation. In CVPR, pages 3431–3440, 2015.
- Hyperseg: Patch-wise hypernetwork for real-time semantic segmentation. In CVPR, pages 4061–4070, 2021.
- In defense of pre-trained imagenet architectures for real-time semantic segmentation of road-driving images. In CVPR, June 2019.
- Pytorch: An imperative style, high-performance deep learning library. NeurIPS, 32:8026–8037, 2019.
- Detectors: Detecting objects with recursive feature pyramid and switchable atrous convolution. In CVPR, pages 10213–10224, 2021.
- Designing network design spaces. In CVPR, pages 10428–10436, 2020.
- Overfeat: Integrated recognition, localization and detection using convolutional networks. In ICLR, 2014.
- Training region-based object detectors with online hard example mining. In CVPR, pages 761–769, 2016.
- Real-time semantic segmentation via multiply spatial fusion network. arXiv preprint arXiv:1911.07217, 2019.
- Efficientnet: Rethinking model scaling for convolutional neural networks. In ICML, pages 6105–6114. PMLR, 2019.
- Efficientnetv2: Smaller models and faster training. In ICML, 2021.
- Deep high-resolution representation learning for visual recognition. TPAMI, 2019.
- Aggregated residual transformations for deep neural networks. In CVPR, pages 1492–1500, 2017.
- Bisenet v2: Bilateral network with guided aggregation for real-time semantic segmentation. IJCV, pages 1–18, 2021.
- Bisenet: Bilateral segmentation network for real-time semantic segmentation. In ECCV, pages 325–341, 2018.
- Resnest: Split-attention networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2736–2746, 2022.
- Customizable architecture search for semantic segmentation. In CVPR, pages 11641–11650, 2019.
- Pyramid scene parsing network. In CVPR, pages 2881–2890, 2017.
- Improving semantic segmentation via video propagation and label relaxation. In CVPR, pages 8856–8865, 2019.