YOLOv11-RGBT: Towards a Comprehensive Single-Stage Multispectral Object Detection Framework (2506.14696v2)
Abstract: Multispectral object detection, which integrates information from multiple bands, can enhance detection accuracy and environmental adaptability, holding great application potential across various fields. Although existing methods have made progress in cross-modal interaction, low-light conditions, and model lightweight, there are still challenges like the lack of a unified single-stage framework, difficulty in balancing performance and fusion strategy, and unreasonable modality weight allocation. To address these, based on the YOLOv11 framework, we present YOLOv11-RGBT, a new comprehensive multimodal object detection framework. We designed six multispectral fusion modes and successfully applied them to models from YOLOv3 to YOLOv12 and RT-DETR. After reevaluating the importance of the two modalities, we proposed a P3 mid-fusion strategy and multispectral controllable fine-tuning (MCF) strategy for multispectral models. These improvements optimize feature fusion, reduce redundancy and mismatches, and boost overall model performance. Experiments show our framework excels on three major open-source multispectral object detection datasets, like LLVIP and FLIR. Particularly, the multispectral controllable fine-tuning strategy significantly enhanced model adaptability and robustness. On the FLIR dataset, it consistently improved YOLOv11 models' mAP by 3.41%-5.65%, reaching a maximum of 47.61%, verifying the framework and strategies' effectiveness. The code is available at: https://github.com/wandahangFY/YOLOv11-RGBT.
Summary
- The paper introduces YOLOv11-RGBT, a comprehensive single-stage framework that optimizes multispectral object detection using a novel P3 mid-fusion strategy.
- It proposes the Multispectral Controllable Fine-tuning (MCF) method, which enhances detection performance by effectively integrating dominant and complementary modalities.
- Extensive experiments on datasets like FLIR and LLVIP demonstrate significant mAP improvements and validate the framework's versatility across various detection models.
Multispectral object detection, which leverages information from multiple spectral bands like visible (RGB) and thermal (infrared), is crucial for robust detection in challenging conditions such as low light or adverse weather. Traditional RGB-based methods struggle in these scenarios, and early multispectral approaches often underutilized the complementary nature of different modalities. While feature fusion has been explored, existing methods face challenges: a lack of a unified single-stage framework applicable across various models, difficulty in balancing detection performance with fusion strategy complexity, and suboptimal allocation of importance to different modalities.
To address these issues, the paper "YOLOv11-RGBT: Towards a Comprehensive Single-Stage Multispectral Object Detection Framework" introduces YOLOv11-RGBT, a comprehensive single-stage framework built upon the YOLOv11 architecture (2506.14696). This framework is designed to support various multispectral computer vision tasks, with a particular focus on object detection. It features a modular design comprising a backbone for feature extraction, a neck for feature processing and fusion, and a head for task execution. The framework's flexibility allows it to be applied not only to different YOLO versions (YOLOv3-YOLOv12) but also to other single-stage detectors like RT-DETR [zhao_detrs_2024] and PP-YOLOE [xu_pp-yoloe_2022].
A core contribution of YOLOv11-RGBT is the reevaluation and proposal of multispectral feature fusion strategies, particularly focusing on mid-level fusion. While conventional mid-level fusion often involves fusing features at multiple stages (P3 to P5), which can introduce redundancy and interference, the authors propose a simplified P3 mid-level fusion strategy. This approach fuses features from visible and infrared modalities specifically at the P3 layer, processed by a trainable module after concatenation. This single-node fusion aims to effectively utilize early-stage features while reducing computational overhead and parameters compared to multi-node fusion.
Acknowledging that the importance of modalities varies depending on the dataset and conditions (e.g., infrared is more informative in low light), the paper introduces the Multispectral Controllable Fine-tuning (MCF) strategy. Inspired by ControlNet [zhang_adding_2023], MCF addresses the issue of unbalanced modality importance and facilitates leveraging pre-trained knowledge. It involves freezing a pre-trained single-modal model (e.g., trained on infrared images if that modality is dominant) and then introducing features from the complementary modality (e.g., visible) through a trainable Zero Conv2d layer. This allows controlled fine-tuning and integration of the second modality's information while preserving the robust features learned from the dominant modality or pre-trained weights. The paper demonstrates that MCF can enhance detection stability and performance, particularly on datasets where one modality is significantly more informative.
The YOLOv11-RGBT framework also explores six multispectral fusion modes (early, mid-level, mid-posterior, late, score, and weight-sharing) and demonstrates their applicability to multiple single-stage networks, providing a versatile platform for multispectral tasks.
The effectiveness of the proposed framework and strategies is validated through extensive experiments on prominent multispectral object detection datasets: FLIR [zhang_multispectral_2020], LLVIP [jia_llvip_2021], KAIST [choi_kaist_2018], M3FD [liu_target-aware_2022], and VEDAI [razakarivony_vehicle_2016]. Experiments on FLIR and LLVIP datasets, where infrared is typically more dominant in low-light conditions, show that standard mid-fusion might not always outperform single-modal infrared detection. However, the proposed P3 mid-fusion often yields better results than multi-node mid-fusion, highlighting the benefit of optimized fusion locations. Crucially, the MCF strategy demonstrates significant performance improvements on both FLIR and LLVIP, consistently boosting the mAP of YOLOv11 models by 3.41% to 5.65% on FLIR and achieving state-of-the-art results on LLVIP compared to other multispectral methods. On the M3FD dataset, where visible light can also be highly informative, multispectral fusion generally outperforms single-modal detection, and transfer learning from the COCO dataset [lin_microsoft_2015] is shown to be effective. The experiments on M3FD also compare various fusion stages, indicating that while mid-fusion is often strong, the optimal strategy can be dataset-dependent.
The paper includes visualization of feature maps (Figure 1 in the paper), illustrating how multispectral fusion enhances feature representation by combining information from both visible and infrared images, resulting in better-defined object features compared to single-modal inputs. Qualitative results (Figure 2 in the paper) showcase the YOLOv11-RGBT-MCF model's ability to detect objects robustly in diverse challenging scenarios, including complex backgrounds, low visibility, and uneven lighting.
The discussion highlights that the performance of specific fusion strategies can vary significantly across datasets. The authors also mention exploring other enhancements like Multispectral PGI (Programmable Gradient Information) [wang_yolov9_2024] and lightweight cross-attention mechanisms, noting that their effectiveness can be dataset-dependent, suggesting careful selection based on data characteristics. The impact of hyperparameters like batch size on performance is also observed.
Despite limitations related to computational resources affecting the scope of experiments (e.g., not all models tested with pre-trained weights or all fusion modes/modules) and the generalization ability of certain modules, the YOLOv11-RGBT framework demonstrates significant potential. Its efficiency and accuracy make it suitable for real-time applications such as autonomous driving and security monitoring. Future work involves improving module generalization, developing adaptive fusion strategies, exploring more efficient feature extraction and fusion methods, and extending the framework to other multispectral tasks like instance segmentation and keypoint detection. The authors plan to open-source the code and models to facilitate further research and practical application.
In conclusion, YOLOv11-RGBT provides a comprehensive and flexible framework for single-stage multispectral object detection. By introducing optimized fusion strategies like P3 mid-fusion and the robust Multispectral Controllable Fine-tuning (MCF), the paper effectively addresses key challenges in the field, demonstrating improved performance and adaptability across various datasets and models. The work provides valuable insights and practical tools for researchers and engineers working on real-world multispectral vision applications.
The core components and strategies for implementation are:
- Dual-Stream Backbone: Use separate backbone networks (based on YOLOv11's architecture) for processing RGB and Thermal images independently in parallel streams.
- Mid-level Fusion: Implement fusion modules in the neck of the network.
- Conventional Mid-fusion: Concatenate or add features from corresponding layers (e.g., P3, P4, P5) of both streams before feeding them into the neck's feature aggregation path.
- P3 Mid-fusion: Concatenate features only at the P3 layer from both streams and process them through a trainable module (e.g., a Conv layer followed by activation/normalization) before integrating into the neck. Features from higher levels (P4, P5) are not fused across modalities directly in this strategy.
- Multispectral Controllable Fine-tuning (MCF):
- Train a base single-modal detection model (e.g., infrared-only) on the target dataset or relevant pre-training data.
- Instantiate the dual-stream model. Load the trained weights into one stream (e.g., the infrared stream) and freeze these weights.
- Introduce the second modality's features (e.g., visible) and fuse them with the frozen stream's features at a chosen layer (e.g., P3 or other mid-level layers).
- Crucially, introduce the second modality's features via a Zero Conv2d layer before fusion. This convolutional layer is initialized with zero weights, allowing it to learn how to best integrate the new modality's information during fine-tuning without drastically altering the pre-trained stream initially.
- Train only the parameters of the Zero Conv2d layer, the second modality's stream's initial layers (if not frozen), the fusion module, and potentially parts of the neck and head.
- The choice of optimizer (SGD or Adam) and hyperparameters (learning rate, warmup) should be tuned based on the dataset and model size, as indicated by experimental results.
- Multispectral Transfer Training: Load pre-trained weights from a large dataset (like COCO) into the multispectral model architecture. Handle channel inconsistencies (e.g., 3 channels for RGB/COCO vs. potentially 1 for thermal or 2x3/2x1 for dual streams) by adapting the first convolutional layer (e.g., averaging or copying weights, or using a 1x1 convolution). The paper notes that direct transfer from a large non-multispectral dataset might not always be optimal, especially for mid-fusion, highlighting the value of MCF or domain-specific pre-training.
- Loss Function: Implement the standard YOLOv11 loss components: Distribution Focal Loss (Ldfl), Binary Cross-Entropy for classification (Lcls), and CIOU Loss for localization (Lloc). The total loss is a weighted sum: Lall=λdflLdfl+λclsLcls+λlocLloc, with suggested weights (1.0, 0.5, 0.05).
Implementing these strategies requires careful modification of the chosen single-stage detection framework (e.g., a YOLOv8/v11 codebase), including defining the dual backbone structure, implementing the fusion modules at the specified levels, setting up the MCF training procedure with frozen layers and Zero Conv2d, and managing data loading for paired multispectral images. Computational requirements will be higher than single-modal models, especially for models with two full backbones, but the P3 fusion and MCF strategies aim to mitigate this by reducing redundant computations or leveraging pre-trained efficient models. Deployment strategies should consider the real-time constraints of target applications and select models and fusion strategies that balance performance with inference speed.
Follow-up Questions
- How does the proposed P3 mid-level fusion strategy compare to traditional multi-node mid-fusion in terms of computational efficiency and detection performance across various datasets?
- In what scenarios or datasets does the Multispectral Controllable Fine-tuning (MCF) strategy provide the greatest boost in detection accuracy, and why might its benefits vary across different data conditions?
- Can the YOLOv11-RGBT framework and its fusion strategies be effectively adapted to other multispectral tasks such as semantic segmentation or keypoint detection, and what modifications would be necessary?
- How does the choice of modality dominance (e.g., infrared vs. visible) and corresponding pre-training impact the final performance and generalizability of multispectral detectors?
- Find recent papers about multispectral object detection and fusion strategies.
Related Papers
- ICAFusion: Iterative Cross-Attention Guided Feature Fusion for Multispectral Object Detection (2023)
- Multispectral Deep Neural Networks for Pedestrian Detection (2016)
- Cross-Modality Fusion Transformer for Multispectral Object Detection (2021)
- Illumination-aware Faster R-CNN for Robust Multispectral Pedestrian Detection (2018)
- Rethinking Early-Fusion Strategies for Improved Multispectral Object Detection (2024)