Reproducibility of YOLOv8/YOLOv11 mAP in TensorRT under NMS mismatch

Determine whether the discrepancy between multi-class non-maximum suppression (NMS) used during evaluation and single-class NMS used during inference is responsible for the inability to reproduce the reported mean average precision (mAP) of the Ultralytics YOLOv8 and YOLOv11 object detectors when executed with NVIDIA TensorRT, and establish a reproducible TensorRT evaluation configuration that matches the original mAP results for these models.

Background

The paper highlights persistent inconsistencies in latency and accuracy benchmarking for real-time detectors and emphasizes the need to report both using the same model artifact. In this context, the authors report that they could not replicate YOLOv8 and YOLOv11 mAP results with TensorRT and hypothesize an evaluation–inference mismatch in non-maximum suppression (multi-class vs. single-class) as a likely cause.

Resolving this issue is important for fair and reproducible comparisons across detectors and hardware. Establishing a verified cause and standardized configuration would help align evaluation and inference behavior for YOLO-based models within TensorRT and improve reproducibility across studies.

References

We are unable to reproduce YOLOv8 and YOLOv11's mAP results in TensorRT, likely because these models evaluate with multi-class NMS but only use single-class NMS in inference.

RF-DETR: Neural Architecture Search for Real-Time Detection Transformers  (2511.09554 - Robinson et al., 12 Nov 2025) in Table 1 caption, Experiments (Standardizing Latency Evaluation)