YOLOv10: Efficient Real-Time Object Detection
- YOLOv10 is a one-stage, real-time object detection model characterized by modular variants, an efficient spatial–channel decoupled backbone, and innovative dual-assignment training.
- It employs large-kernel convolutions, partial self-attention, and a unified training-inference framework to boost spatial context and minimize computational cost.
- Empirical studies on COCO and specialized benchmarks confirm YOLOv10’s superior accuracy–latency trade-offs, making it ideal for embedded and edge deployments.
YOLOv10 is a one-stage, real-time object detection model that introduces efficiency-driven architectural designs and a consistent dual-assignment (CDA) label strategy to achieve NMS-free inference. Developed as part of the decadal progression of the YOLO (“You Only Look Once”) family, YOLOv10 advances the detection accuracy–speed frontier through holistic module redesign and end-to-end optimization across all model scales. It is characterized by modular variants (nano to extra-large), a highly efficient spatial–channel decoupled backbone, selective use of large-kernel convolutions, and a unified training-inference framework that eliminates the need for post-hoc Non-Maximum Suppression. A variety of empirical studies confirm state-of-the-art accuracy, latency, and parameter trade-offs on both general (COCO 2017) and domain-specific (agriculture, SAR, fisheries) benchmarks (Wang et al., 2024, Hussain, 2024, Saltık et al., 2024, Tariq et al., 14 Apr 2025, Yu et al., 1 Sep 2025, Wuntu et al., 22 Sep 2025, Alif et al., 2024).
1. Core Architectural Features
YOLOv10 adopts a canonical “Stem → Backbone → Neck → Head” topology that is extensible from ultra-lightweight (nano) to large, high-accuracy models. Major innovations introduced at the architectural level include:
- Backbone: Built upon optimized Cross-Stage Partial (CSP) modules, YOLOv10 integrates spatial–channel decoupled downsampling in lieu of standard stride-2 convolutions. Each downsampling module splits the feature map into spatial and channel components, processing them independently before concatenation, reducing computational cost and preserving feature diversity. Rank-guided block allocation concentrates parameters at informative locations, as determined by the intrinsic rank of output convolution weights. On the “small” and “nano” scales, the backbone uses a reduced number of CSP blocks and channel widths (e.g., 2.3 M parameters and 6.7 GFLOPs for YOLOv10-N).
- Large-Kernel Convolutions: Deep stages substitute multiple 3×3 convolutions with large-kernel convolutions (e.g., 7×7 or 15×15) to boost spatial context, particularly for scale-invariant recognition. The deployment is selective to avoid excessive FLOPs in high-resolution layers.
- Partial Self-Attention: Starting from stage four, half of the feature channels pass through one or more multi-head self-attention blocks (batch-normed instead of layer-normed), then fused back to the main path. This yields improved localization with a modest increase in compute and latency.
- Neck: The model uses an efficiency-tuned Path Aggregation Network (PANet) for multi-scale feature fusion, augmented with spatial–channel decoupled downsampling at branch points. Models designed for fish and SAR applications insert a Pyramid Spatial Attention (PSA) block to aggregate across spatial scales.
- Head: YOLOv10 introduces a dual-branch detection head, with anchor-free, per-location predictions. Parallel branches support one-to-many (for recall-rich training) and one-to-one (for NMS-free inference) assignments. The classification head is reduced to two 3×3 depthwise separable convolutions and one 1×1 pointwise convolution, substantially lowering parameter count and compute versus prior YOLO iterations.
- Variants and Model Sizes:
| Variant | Params (M) | FLOPs (G) | Typical AP (%) | Latency (ms, FP16) | |-----------|------------|-----------|----------------|--------------------| | YOLOv10-N | 2.3 | 6.7 | 38.5–39.5 | 1.84 | | YOLOv10-S | 7.2 | 21.6 | 46.3–46.8 | 2.49 | | YOLOv10-M | 15.4 | 59.1 | 51.1–51.3 | 4.74 | | YOLOv10-L | 24.4 | 120.3 | 53.2–53.4 | 7.28 | | YOLOv10-X | 29.5 | 160.4 | 54.4 | 10.70 |
(Wang et al., 2024, Hussain, 2024, Saltık et al., 2024)
2. Consistent Dual-Assignment and NMS-Free Inference
YOLOv10 replaces the non-maximum suppression (NMS) mechanism required by previous YOLO models with a consistent dual-assignment (CDA) label strategy:
- Dual Label Assignment: At training, each ground-truth box receives both one-to-many (o2m; recall-optimized, task-aligned) and one-to-one (o2o; precision-optimized, unique match) assignments, using a matching metric
(where is the predicted class probability, and is a spatial prior).
- Loss Aggregation: Branch losses are combined,
with each branch comprising a weighted sum of classification (binary cross-entropy), localization (CIoU), and distributional focal loss (DFL) for bounding-box bin regression.
- Inference: At test time, only the o2o head—which uniquely assigns boxes to ground truths—is retained, producing NMS-free outputs. This strategy minimizes supervision mismatch between train and test, improves end-to-end latency by 20–40%, and yields a consistent ∼1–2% improvement in mAP (Wang et al., 2024, Hussain, 2024).
3. Training Protocols and Hyperparameters
YOLOv10 adopts diverse augmentation, sampling, and learning schedules to optimize both performance and practical deployment:
- Data Augmentation: Mosaic (4-image compositing), MixUp, CutMix, random resizing, random color/hue-space jitter, and random horizontal flips are used throughout training (Hussain, 2024, Alif et al., 2024).
- Anchor Handling: Predominantly anchor-free, with center-based assignments and per-pixel regression. Special cases (e.g., YOLOv10-nano on fish datasets) use 9 precomputed anchors per three scales, clustered by k-means (Wuntu et al., 22 Sep 2025).
- Optimizers and Schedules: Stochastic gradient descent (SGD) with momentum 0.937 and weight decay 5e-4 is the standard. Learning rate schedules follow linear or cosine annealing, with linear warmup over initial epochs and decay to 1e-4 or 1e-5. Batch sizes are selected to fully utilize available GPU RAM (typically 16–64 per GPU) (Wang et al., 2024, Alif et al., 2024).
- Regularization: Consistency between o2m and o2o assignments is enforced by an L2 penalty,
where is a balancing coefficient (Alif et al., 2024).
4. Quantitative Performance and Empirical Comparison
YOLOv10 attains state-of-the-art accuracy–efficiency trade-offs across diverse detection domains:
- COCO 2017 Benchmarks (latency on NVIDIA T4, FP16, batch=1):
| Model | AP (%) | Latency (ms) | Params (M) | FLOPs (G) | |----------------|--------|--------------|------------|-----------| | YOLOv10-N | 38.5 | 1.84 | 2.3 | 6.7 | | YOLOv10-S | 46.3 | 2.49 | 7.2 | 21.6 | | YOLOv10-M | 51.1 | 4.74 | 15.4 | 59.1 | | YOLOv10-L | 53.2 | 7.28 | 24.4 | 120.3 | | YOLOv10-X | 54.4 | 10.70 | 29.5 | 160.4 |
- Small Object Sensitivity: YOLOv10-N achieves [email protected] 50.74% (overall), 48.26% (small objects ≈1% area), outperforming YOLOv9 and YOLOv8 on small object recall at matched inference speed (Tariq et al., 14 Apr 2025).
- Smart Agriculture (weed/crop detection dataset, RTX 3080 laptop, 640 px):
- YOLOv10-n: 92.4% mAP50, 21.5 ms latency (≈46 FPS)
- YOLOv10-s: 93.1% mAP50, 21.8 ms
- YOLOv10-m: 93.4% mAP50, 25.7 ms
- Edge/Fish Detection (DeepFish, Intel i7 CPU, YOLOv10-n):
- mAP50 = 0.966, mAP50:95 = 0.606, 2.7 M parameters, 8.4 GFLOPs, 29.3 FPS (Wuntu et al., 22 Sep 2025)
- Synthetic Aperture Radar (SARDet-100K, YOLOv10-N):
- 6.53 GFLOPs, 2.27 M parameters, 59.47% mAP (Yu et al., 1 Sep 2025)
YOLOv10’s empirical results consistently demonstrate a “sweet-spot” on the accuracy–latency curve, especially for real-time and resource-constrained scenarios.
5. Trade-Offs, Ablations, and Deployment
- Ablation Studies:
- Removing rank-guided blocks reduces AP by 1.3% and efficacy.
- Returning to standard stride-2 convolutions adds ≈15% FLOPs for a 0.9% AP loss.
- Omission of the dual-assignment regularizer increases latency (by 10 ms) with negligible accuracy change, but reduces consistency (Alif et al., 2024).
- Model Scaling and Edge Deployment:
- YOLOv10-N runs at ≈550 FPS (A100), ≈150 FPS (Jetson Xavier NX), and ≥30 FPS on desktop CPUs (OpenVINO, ONNX Runtime).
- All standard variants (N, S, M, L, X) fit into ≤256 MB RAM, with on-disk sizes from ≈9 MB (N) up to ≈35 MB (X).
- For maximum FPS with moderate accuracy, the “nano” variant (YOLOv10-n) is optimal. The “small” variant provides a strong balance (>46% AP, ≈2.5 ms latency).
- Quantization to int8 or pruning can further reduce memory and computational requirements without severe accuracy degradation (Wuntu et al., 22 Sep 2025, Hussain, 2024).
6. Domain-Specific Adaptation and Specialized Use
YOLOv10 has been successfully adapted and optimized for multiple application-specific domains:
- SAR Object Detection: Neural architecture search (NAS) on the YOLOv10 backbone produces SAR-NAS variants that reduce deep-layer redundancy, yielding mAP improvements of +0.71% at 0.57 G fewer FLOPs on the SARDet-100K dataset (Yu et al., 1 Sep 2025).
- Fish and Marine Biodiversity: YOLOv10-nano, equipped with CSPNet, PAN, and PSA, attains 0.966 mAP50 at <3 M parameters, outperforming much heavier models in resource-constrained aquatic research deployments (Wuntu et al., 22 Sep 2025).
- Smart Agriculture/Weed Management: Across resolutions and model sizes, YOLOv10 delivers >93% mAP50 in real-time weed/crop detection, confirming effective feature extraction for fine-grained, dense-field imagery (Saltık et al., 2024).
A common pattern is the strong transferability of the decoupled, channel-pruned backbone and NMS-free head to domains with small, densely packed, or ambiguous objects.
7. Limitations and Future Directions
- In Dense Scenes: NMS-free inference can yield overlapping detections in extremely crowded settings; soft-NMS or hybrid post-processing may refine output (Hussain, 2024).
- Partial Self-Attention Overhead: While PSA adds <1% GFLOPs, further pruning may be necessary for sub-watt NPUs.
- Scalable Receptive Field: Dynamic large-kernel convolutions (“big-little” strategies) and learned kernel sizing are prospective enhancements.
- Model Compression: Post-training quantization and cross-scale distillation are active areas for further reduction of memory/compute while maintaining AP.
- Unified Backbone Search: Extensions to multi-task and multimodal backbones via NAS or other hardware-aware search techniques offer promising gains in specialized domains (SAR, medical imaging).
YOLOv10’s combination of efficient label assignment, modular architecture, and strong empirical performance positions it as a recommended baseline for real-time, embedded, and edge deployments across a heterogeneous set of object detection tasks (Wang et al., 2024, Hussain, 2024, Saltık et al., 2024, Tariq et al., 14 Apr 2025, Yu et al., 1 Sep 2025, Wuntu et al., 22 Sep 2025, Alif et al., 2024).