The revenge of BiSeNet: Efficient Multi-Task Image Segmentation (2404.09570v1)
Abstract: Recent advancements in image segmentation have focused on enhancing the efficiency of the models to meet the demands of real-time applications, especially on edge devices. However, existing research has primarily concentrated on single-task settings, especially on semantic segmentation, leading to redundant efforts and specialized architectures for different tasks. To address this limitation, we propose a novel architecture for efficient multi-task image segmentation, capable of handling various segmentation tasks without sacrificing efficiency or accuracy. We introduce BiSeNetFormer, that leverages the efficiency of two-stream semantic segmentation architectures and it extends them into a mask classification framework. Our approach maintains the efficient spatial and context paths to capture detailed and semantic information, respectively, while leveraging an efficient transformed-based segmentation head that computes the binary masks and class probabilities. By seamlessly supporting multiple tasks, namely semantic and panoptic segmentation, BiSeNetFormer offers a versatile solution for multi-task segmentation. We evaluate our approach on popular datasets, Cityscapes and ADE20K, demonstrating impressive inference speeds while maintaining competitive accuracy compared to state-of-the-art architectures. Our results indicate that BiSeNetFormer represents a significant advancement towards fast, efficient, and multi-task segmentation networks, bridging the gap between model efficiency and task adaptability.
- End-to-end object detection with transformers. In ECCV, 2020.
- Pem: Prototype-based efficient maskformer for image segmentation. CVPR, 2024.
- Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence, 40(4):834–848, 2017a.
- Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587, 2017b.
- Panoptic-deeplab: A simple, strong, and fast baseline for bottom-up panoptic segmentation. In CVPR, 2020.
- Per-pixel classification is not all you need for semantic segmentation. NeurIPS, 34:17864–17875, 2021.
- Masked-attention mask transformer for universal image segmentation. In CVPR, 2022.
- The cityscapes dataset for semantic urban scene understanding. In CVPR, 2016.
- Fast panoptic segmentation network. IEEE Robotics and Automation Letters, 5(2):1742–1749, 2020.
- Imagenet: A large-scale hierarchical image database. In CVPR, 2009.
- The pascal visual object classes challenge: A retrospective. International journal of computer vision, 111:98–136, 2015.
- Rethinking bisenet for real-time semantic segmentation. In CVPR, 2021.
- Lpsnet: A lightweight solution for fast panoptic segmentation. In CVPR, 2021a.
- Deep dual-resolution networks for real-time and accurate semantic segmentation of road scenes. arXiv preprint arXiv:2101.06085, 2021b.
- Real-time panoptic segmentation from dense detections. In CVPR, 2020.
- Squeeze-and-excitation networks. In CVPR, 2018.
- You only segment once: Towards real-time panoptic segmentation. In CVPR, 2023.
- Panoptic segmentation. In CVPR, 2019.
- Decoupled weight decay regularization. In ICLR, 2018.
- V-net: Fully convolutional neural networks for volumetric medical image segmentation. In 3DV, 2016.
- Pp-liteseg: A superior real-time semantic segmentation model. arXiv preprint arXiv:2204.02681, 2022.
- Solo: Segmenting objects by locations. In ECCV, 2020.
- Bidirectional graph reasoning network for panoptic segmentation. In CVPR, 2020.
- Segformer: Simple and efficient design for semantic segmentation with transformers. Advances in neural information processing systems, 34:12077–12090, 2021.
- Upsnet: A unified panoptic segmentation network. In CVPR, 2019.
- Pidnet: A real-time semantic segmentation network inspired by pid controllers. In CVPR, 2023.
- Bisenet: Bilateral segmentation network for real-time semantic segmentation. In ECCV, 2018.
- Bisenet v2: Bilateral network with guided aggregation for real-time semantic segmentation. International Journal of Computer Vision, 129:3051–3068, 2021.
- kmax-deeplab: k-means mask transformer. In ECCV, 2022.
- Scene parsing through ade20k dataset. In CVPR, 2017.