ODverse33 Benchmark Evaluation

Updated 6 February 2026

ODverse33 Benchmark is a comprehensive multi-domain evaluation suite that measures YOLO v5–v11 performance across 33 datasets with varied detection challenges.
It employs COCO-based metrics and diverse imaging conditions to rigorously assess accuracy, latency, and model size trade-offs across different deployment scenarios.
Empirical results reveal non-monotonic improvements in YOLO architectures, emphasizing the need for domain-specific benchmarking in practical applications.

ODverse33 is a comprehensive multi-domain benchmarking suite devised to systematically evaluate the performance progression of YOLO object detectors from v5 through v11. Designed specifically to reflect the practical diversity encountered in real-world use cases, ODverse33 incorporates 33 datasets spanning 11 distinct domains that collectively tax models across a wide spectrum of class cardinalities, object scales, imaging conditions, and annotation conventions (Jiang et al., 20 Feb 2025). The benchmark facilitates rigorous cross-version and cross-domain analysis, addressing whether and when advances in YOLO models translate into meaningful improvements in detection performance for specific application scenarios.

1. Dataset Composition and Domain Coverage

ODverse33 aggregates datasets representing the following domains and challenges:

Domain	Sample Datasets	Key Detection Challenges
Autonomous Driving	BDD100K, KITTI, TSDD	Weather/lighting variation, small object detection, occlusion
Agricultural	WeedCrop, HoneyBee, Pear640	Texture similarity, dense foliage, tiny/in-flight targets
Underwater	DUO, RUOD, UWD	Low contrast, turbidity, color shift, background clutter
Medical	ChestX-Det, GRAZPEDWRI-DX, BCD, BBD	Low contrast, anatomical overlap, subtle structural variation
Videogame	MC, CS2, GTA5	Synthetic-real domain gap, fast motion, stylization
Industrial	DeepPCB, GC10-DET, NEU-SDD	Fine-grained textures, small/low-contrast defects
Aerial	DIOR, DOTA, HIT-UAV	Multi-scale, arbitrary orientation, dense scenes, rotated bounding boxes
Wildlife	ADID, EAD	Camouflage, illumination, rare species
Retail	SFD, Holoselecta, SKU110K	Dense scenes, occlusion, similar designs
Microscopic	BCCD, LDD, MIaMIA-SVDS	Overlapping targets, stain variability, tiny mobile objects
Security	SIXray, HiXray, MGD	Heavy overlap, fine structure, varied orientation

Each constituent dataset is selected to stress critical aspects of real-world detection: scale variance (from single-digit to hundred-class recognition), occlusion, illumination, clutter, and intra-class variation. The distribution of object sizes, densities, and image modalities ensures the evaluation reflects the heterogeneous operational landscapes where detector robustness is essential.

2. Evaluation Metrics and Protocol

Performance is assessed with a suite of established object detection metrics consistent with COCO conventions:

mAP50: Mean Average Precision at IoU threshold 0.50.
mAP50–95: Averaged over IoU thresholds from 0.50 to 0.95 in steps of 0.05.
mAPsmall, mAPmedium, mAPlarge: Size-stratified mAP50–95 for small ( $<32^2$ px), medium ( $32^2$ – $96^2$ px), and large ( $>96^2$ px) objects.
Inference latency: In ms/image measured on NVIDIA A100 GPU, FP16, batch size 1.
Model size: Number of parameters (in millions).

Key mathematical definitions:

Intersection over Union:

$\mathrm{IoU}(B_p,B_{gt}) = \frac{\text{area}(B_p\cap B_{gt})}{\text{area}(B_p\cup B_{gt})}$

Average Precision per class $k$ :

$\mathrm{AP}_k = \int_0^1 p_k(r) dr$

with $p_k(r)$ the interpolated precision at recall $r$ .

Mean Average Precision:

$\mathrm{mAP} = \frac{1}{K} \sum_{k=1}^K \mathrm{AP}_k$

Test-time evaluation adheres to per-dataset splits and augmentation protocols as specified in the respective original dataset sources, ensuring comparability and reproducibility for practitioners.

3. Architectural Advances Across YOLO v1–v11

ODverse33 is deployed specifically to elucidate the empirical impact of major innovations in YOLO architectures:

YOLOv1: Single-stage, unified grid-based prediction (Darknet-19 backbone).
YOLOv2: K-means anchor boxes, multi-scale training, joint classification/detection pretraining (“9000 classes”).
YOLOv3: Darknet-53 backbone, multi-scale prediction via three output heads (FPN-style).
YOLOv4: CSPDarknet-53, PANet, CIoU loss, “bag of freebies” (augmentation).
YOLOv5: PyTorch-native, multi-size models (s/m/l/x), dynamic label assignment.
YOLOv6: EfficientRep backbone, decoupled classification/localization head.
YOLOv7: E-ELAN, RepConvN re-parametrizable convolutions.
YOLOv8: Further tweaks (v5 core), unified head for detection/segmentation/pose/OBB.
YOLOv9: Programmable Gradient Information (PGI), GELAN (CSP+ELAN fusion).
YOLOv10: One-to-many (training)/one-to-one (inference) multi-proposal head, anchor-free prediction.
YOLOv11: C3k2 blocks (CSP, k=2), C2PSA (parallel spatial attention), full multi-task suite with enhanced backbone.

This granular chronology enables ODverse33 to interrogate how architectural mechanisms (e.g., attention, head decoupling, advanced aggregation) differentially affect performance on domain-specific detection tasks.

4. Empirical Results: Per-Version and Per-Domain Patterns

ODverse33’s analysis reveals non-monotonic improvements across YOLO releases. For 33 datasets, YOLOv11 achieves the top overall mean with mAP50 = 0.8072 and mAP50–95 = 0.5983. The version-wise mAP50 ranking is as follows:

Rank	YOLO Version	Overall mAP50
1	v11	0.8072
2	v9	—
3	v5	—
4	v7	—
5	v8	—
6	v10	—
7	v6	—

Domain-wise best performance:

Domain	Best Version	mAP50
Aerial	v11	~0.9130
Agricultural	v11	~0.8922
AutonomousDrive	v11	~0.7384
Videogame	v11	~0.9436
Microscopic	v11	~0.7384
Wildlife	v11	~0.7959
Industrial	v9	~0.7621
Medical	v9	~0.9673
Retail	v8	~0.8101
Security	v8	~0.8701
Underwater	v5	~0.7978

Latency and model size trade-offs are nontrivial: v11 achieves peak accuracy with around 20 ms/image inference latency and approximately 50 million parameters; by contrast, the v5s variant processes images at ≈5 ms/image with only ~7 million parameters, albeit at reduced mAP.

This suggests that architecture and domain interaction substantially impact effective accuracy, and regression in performance is observed in both v6 (vs. v5) and v10 (vs. v8), contradicting assumptions of monotonic improvement with version progression.

5. Domain-Specific Recommendations and Trade-Off Analysis

Key observations and guidance derived from ODverse33:

Aerial, Agricultural: YOLOv11’s multi-scale and parallel spatial attention modules furnish robustness for rotated, variably-oriented, and densely-packed targets.
Autonomous Driving, Videogame: YOLOv11 supplies highest mAP50, demonstrating resilience to occlusion and synthetic-real domain discrepancies.
Microscopic, Wildlife: YOLOv11 outperforms on tiny object and camouflage scenarios.
Industrial, Medical: YOLOv9 (PGI and GELAN) outshines others for precision in small defect/anomaly detection.
Retail, Security: YOLOv8 offers the optimal speed-accuracy compromise for scenes with high-density and visually similar products.
Underwater: YOLOv5 maintains performance under adverse imaging with a simpler backbone.

Given resource constraints, v5 and v7 s/m variants are preferable for real-time/embedded deployments due to their sub-10M parameter footprint and low latency. High-variance domains or multi-task pipelines (segmentation, pose, OBB) benefit from the unified heads in v8 and v11.

6. Practitioner Guidance and Benchmark Significance

ODverse33 establishes that “newer ≠ always better.” Empirical benchmarking on target domains with ODverse33 splits is necessary to validate model selection. YOLOv11 is preferable where latency is not restrictive; YOLOv9 is recommended for small-object-centric applications in medical and industrial contexts. For energy- or memory-constrained environments, YOLOv5 s/m offers the strongest speed/accuracy ratio. Multi-task detection and segmentation workflows should employ YOLOv8 or YOLOv11 for unified deployment.

Ultimately, ODverse33 exposes both effective advances and regressions in YOLO evolution, advocating for domain-attuned model benchmarking as a prerequisite to deployment in real-world, multi-domain object detection scenarios (Jiang et al., 20 Feb 2025).

Markdown Report Issue Upgrade to Chat

References (1)

ODverse33: Is the New YOLO Version Always Better? A Multi Domain benchmark from YOLO v5 to v11 (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ODverse33 Benchmark.

ODverse33 Benchmark Evaluation

1. Dataset Composition and Domain Coverage

2. Evaluation Metrics and Protocol

3. Architectural Advances Across YOLO v1–v11

4. Empirical Results: Per-Version and Per-Domain Patterns

5. Domain-Specific Recommendations and Trade-Off Analysis

6. Practitioner Guidance and Benchmark Significance

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

ODverse33 Benchmark Evaluation

1. Dataset Composition and Domain Coverage

2. Evaluation Metrics and Protocol

3. Architectural Advances Across YOLO v1–v11

4. Empirical Results: Per-Version and Per-Domain Patterns

5. Domain-Specific Recommendations and Trade-Off Analysis

6. Practitioner Guidance and Benchmark Significance

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research