BridgeNet: Comprehensive and Effective Feature Interactions via Bridge Feature for Multi-task Dense Predictions (2312.13514v2)
Abstract: Multi-task dense prediction aims at handling multiple pixel-wise prediction tasks within a unified network simultaneously for visual scene understanding. However, cross-task feature interactions of current methods are still suffering from incomplete levels of representations, less discriminative semantics in feature participants, and inefficient pair-wise task interaction processes. To tackle these under-explored issues, we propose a novel BridgeNet framework, which extracts comprehensive and discriminative intermediate Bridge Features, and conducts interactions based on them. Specifically, a Task Pattern Propagation (TPP) module is firstly applied to ensure highly semantic task-specific feature participants are prepared for subsequent interactions, and a Bridge Feature Extractor (BFE) is specially designed to selectively integrate both high-level and low-level representations to generate the comprehensive bridge features. Then, instead of conducting heavy pair-wise cross-task interactions, a Task-Feature Refiner (TFR) is developed to efficiently take guidance from bridge features and form final task predictions. To the best of our knowledge, this is the first work considering the completeness and quality of feature participants in cross-task interactions. Extensive experiments are conducted on NYUD-v2, Cityscapes and PASCAL Context benchmarks, and the superior performance shows the proposed architecture is effective and powerful in promoting different dense prediction tasks simultaneously.
- Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(12):2481–2495, 2017.
- Automated search for resource-efficient branched multi-task networks. arXiv preprint arXiv:2008.10292, 2020.
- Exploring relational context for multi-task dense prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15869–15878, 2021.
- Rethinking cross-domain pedestrian detection: a background-focused distribution alignment framework for instance-free one-stage detectors. IEEE transactions on image processing, 2023.
- End-to-end object detection with transformers. In European Conference on Computer Vision, pages 213–229. Springer, 2020.
- Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(4):834–848, 2017.
- Detect what you can: Detecting and representing objects using holistic models and body parts. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 1971–1978, 2014.
- Cascaded pyramid network for multi-person pose estimation. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 7103–7112, 2018.
- Michael Crawshaw. Multi-task learning with deep neural networks: A survey. arXiv preprint arXiv:2009.09796, 2020.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In Proceedings of the IEEE International Conference on Computer Vision, pages 2650–2658, 2015.
- Depth map prediction from a single image using a multi-scale deep network. Advances in Neural Information Processing Systems, 27, 2014.
- Dual attention network for scene segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3146–3154, 2019.
- Nddr-cnn: Layerwise feature fusing in multi-task cnns by neural discriminative dimensionality reduction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3205–3214, 2019.
- Learning to branch for multi-task learning. In International Conference on Machine Learning, pages 3854–3863. PMLR, 2020.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
- Multitask-centernet (mcn): Efficient and diverse multitask learning using an anchor free approach. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 997–1005, 2021.
- Deeper depth prediction with fully convolutional residual networks. In 2016 Fourth international conference on 3D vision (3DV), pages 239–248. IEEE, 2016.
- Pedestrian detection based on yolo network model. In 2018 IEEE international conference on mechatronics and automation (ICMA), pages 1547–1551. IEEE, 2018.
- Knowledge distillation for multi-task learning. In European Conference on Computer Vision, pages 163–176. Springer, 2020.
- Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In European conference on computer vision, pages 1–18. Springer, 2022.
- Video object segmentation with adaptive feature bank and uncertain-region refinement. Advances in Neural Information Processing Systems, 33:3430–3441, 2020.
- End-to-end multi-task learning with attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1871–1880, 2019.
- Petrv2: A unified framework for 3d perception from multi-camera images. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3262–3272, 2023.
- Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10012–10022, 2021.
- Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 3431–3440, 2015.
- Fully-adaptive feature sharing in multi-task networks with applications in person attribute classification. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 5334–5343, 2017.
- Attentive single-tasking of multiple tasks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1851–1860, 2019.
- Cross-stitch networks for multi-task learning. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 3994–4003, 2016.
- Stacked hourglass networks for human pose estimation. In European Conference on Computer Vision, pages 483–499. Springer, 2016.
- Pytorch: An imperative style, high-performance deep learning library. Advances in Neural information Processing Systems, 32:8026–8037, 2019.
- U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015.
- Latent multi-task architecture learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 4822–4829, 2019.
- Indoor segmentation and support inference from rgbd images. In European Conference on Computer Vision, pages 746–760. Springer, 2012.
- Deep high-resolution representation learning for human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5693–5703, 2019.
- Branched multi-task networks: Deciding what layers to share. Proceedings British Machine Vision Conference 2020, 2019.
- Multi-task learning for dense prediction tasks: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021.
- Mti-net: Multi-scale task interaction networks for multi-task learning. In European Conference on Computer Vision, pages 527–543. Springer, 2020.
- Attention is all you need. In Advances in Neural information Processing Systems, pages 5998–6008, 2017.
- Understanding convolution for semantic segmentation. In 2018 IEEE winter conference on applications of computer vision (WACV), pages 1451–1460. Ieee, 2018.
- Internimage: Exploring large-scale vision foundation models with deformable convolutions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14408–14419, 2023.
- Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 568–578, 2021.
- Non-local neural networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 7794–7803, 2018.
- Convolutional pose machines. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 4724–4732, 2016.
- Cvt: Introducing convolutions to vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 22–31, 2021.
- A mutual learning method for salient object detection with intertwined multi-supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8150–8159, 2019.
- Segformer: Simple and efficient design for semantic segmentation with transformers. Advances in Neural Information Processing Systems, 34:12077–12090, 2021.
- Pad-net: Multi-tasks guided prediction-and-distillation network for simultaneous depth estimation and scene parsing. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 675–684, 2018.
- D3vo: Deep depth, deep pose and deep uncertainty for monocular visual odometry. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1281–1292, 2020.
- Inverted pyramid multi-task transformer for dense scene understanding. ECCV, 2022.
- Student becoming the master: Knowledge amalgamation for joint scene parsing, depth estimation, and more. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2829–2838, 2019.
- Joint task-recursive learning for semantic segmentation and depth estimation. In Proceedings of the European Conference on Computer Vision, pages 235–251, 2018.
- Pattern-affinitive propagation across depth, surface normal and semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4106–4115, 2019.
- Pyramid scene parsing network. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 2881–2890, 2017.
- Pattern-structure diffusion for multi-task learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4514–4523, 2020.