Removal then Selection: A Coarse-to-Fine Fusion Perspective for RGB-Infrared Object Detection (2401.10731v6)
Abstract: In recent years, object detection utilizing both visible (RGB) and thermal infrared (IR) imagery has garnered extensive attention and has been widely implemented across a diverse array of fields. By leveraging the complementary properties between RGB and IR images, the object detection task can achieve reliable and robust object localization across a variety of lighting conditions, from daytime to nighttime environments. Most existing multi-modal object detection methods directly input the RGB and IR images into deep neural networks, resulting in inferior detection performance. We believe that this issue arises not only from the challenges associated with effectively integrating multimodal information but also from the presence of redundant features in both the RGB and IR modalities. The redundant information of each modality will exacerbates the fusion imprecision problems during propagation. To address this issue, we draw inspiration from the human brain's mechanism for processing multimodal information and propose a novel coarse-to-fine perspective to purify and fuse features from both modalities. Specifically, following this perspective, we design a Redundant Spectrum Removal module to remove interfering information within each modality coarsely and a Dynamic Feature Selection module to finely select the desired features for feature fusion. To verify the effectiveness of the coarse-to-fine fusion strategy, we construct a new object detector called the Removal then Selection Detector (RSDet). Extensive experiments on three RGB-IR object detection datasets verify the superior performance of our method.
- Domain separation networks. Advances in neural information processing systems, 29, 2016.
- Cascade r-cnn: High quality object detection and instance segmentation. IEEE transactions on pattern analysis and machine intelligence, 43(5):1483–1498, 2019.
- Box-level segmentation supervised deep neural networks for accurate and real-time multispectral pedestrian detection. ISPRS journal of photogrammetry and remote sensing, 150:70–79, 2019.
- Multi-modal gated mixture of local-to-global experts for dynamic image fusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 23555–23564, 2023.
- Multimodal object detection by channel switching and spatial attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 403–411, 2023.
- Mmdetection: Open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155, 2019.
- Multimodal object detection via probabilistic ensembling. In European Conference on Computer Vision, pages 139–158. Springer, 2022.
- Mod-squad: Designing mixtures of experts as modular multi-task learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11828–11837, 2023.
- Hard mixtures of experts for large scale weakly supervised vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6865–6873, 2017.
- Fusion of multispectral data through illumination-aware deep neural networks for pedestrian detection. Information Fusion, 50:148–157, 2019.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- Multispectral pedestrian detection: Benchmark dataset and baseline. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1037–1045, 2015.
- Adaptive mixtures of local experts. Neural computation, 3(1):79–87, 1991.
- Llvip: A visible-infrared paired dataset for low-light vision. In Proceedings of the IEEE/CVF international conference on computer vision, pages 3496–3504, 2021.
- Hierarchical mixtures of experts and the em algorithm. Neural computation, 6(2):181–214, 1994.
- Towards versatile pedestrian detector with multisensory-matching and multispectral recalling memory. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 1157–1165, 2022.
- Fully convolutional region proposal networks for multispectral person detection. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 49–56, 2017.
- {GS}hard: Scaling giant models with conditional computation and automatic sharding. In International Conference on Learning Representations, 2021.
- Multispectral pedestrian detection via simultaneous detection and segmentation. In British Machine Vision Conference (BMVC), 2018.
- Illumination-aware faster r-cnn for robust multispectral pedestrian detection. Pattern Recognition, 85:161–171, 2019.
- Confidence-aware fusion using dempster-shafer theory for multispectral pedestrian detection. IEEE Transactions on Multimedia, 2022.
- Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017.
- Multispectral deep neural networks for pedestrian detection. arXiv preprint arXiv:1611.02644, 2016.
- Ssd: Single shot multibox detector. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14, pages 21–37. Springer, 2016.
- Dense-to-sparse gate for mixture-of-experts, 2022.
- Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis & Machine Intelligence, 39(06):1137–1149, 2017.
- Learning disentangled representations via mutual information estimation. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXII 16, pages 205–221. Springer, 2020.
- Aditya Sanghi. Info3d: Representation learning on 3d objects using mutual information maximization and contrastive learning. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXIX 16, pages 626–642. Springer, 2020.
- Drone-based rgb-infrared cross-modality vehicle detection via uncertainty-aware learning. IEEE Transactions on Circuits and Systems for Video Technology, 32(10):6700–6713, 2022.
- Anne M Treisman. Selective attention in man. British medical bulletin, 1964.
- On mutual information maximization for representation learning. In International Conference on Learning Representations, 2019.
- Aspnet: Action segmentation with shared-private representation of multiple data sources. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2384–2393, 2023.
- Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(11), 2008.
- Multi-modal learning with missing modality via shared-specific feature modelling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15878–15887, 2023.
- Learning a dynamic cross-modal network for multispectral pedestrian detection. In Proceedings of the 30th ACM International Conference on Multimedia, pages 4043–4052, 2022.
- C2superscriptC2\textup{C}^{2}C start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPTformer: Calibrated and complementary transformer for rgb-infrared object detection. arXiv preprint arXiv:2306.16175, 2023.
- Translation, scale and rotation: Cross-modal alignment meets rgb-infrared vehicle detection. In European Conference on Computer Vision, pages 509–525. Springer, 2022.
- Improving rgb-infrared object detection with cascade alignment-guided transformer. Information Fusion, page 102246, 2024.
- Cross-modality interactive attention network for multispectral pedestrian detection. Information Fusion, 50:20–29, 2019.
- Weakly aligned cross-modal learning for multispectral pedestrian detection. In Proceedings of the IEEE/CVF international conference on computer vision, pages 5127–5137, 2019.
- Multispectral fusion for object detection with cyclic fuse-and-refine blocks. In 2020 IEEE International conference on image processing (ICIP), pages 276–280. IEEE, 2020.
- Guided attentive feature fusion for multispectral pedestrian detection. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 72–80, 2021.
- Dense distinct query for end-to-end object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7329–7338, 2023.
- Improving multispectral pedestrian detection by addressing modality imbalance problems. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVIII 16, pages 787–803. Springer, 2020.
- Mixture-of-experts with expert choice routing. Advances in Neural Information Processing Systems, 35:7103–7114, 2022.
- Lgadet: Light-weight anchor-free multispectral pedestrian detection with mixed local and global attention. Neural Processing Letters, 55(3):2935–2952, 2023.
- Tianyi Zhao (12 papers)
- Maoxun Yuan (7 papers)
- Xingxing Wei (60 papers)
- Feng Jiang (97 papers)
- Nan Wang (147 papers)