Papers
Topics
Authors
Recent
2000 character limit reached

YOLOv10: Efficient Real-Time Object Detection

Updated 13 November 2025
  • YOLOv10 is a one-stage, real-time object detection model characterized by modular variants, an efficient spatial–channel decoupled backbone, and innovative dual-assignment training.
  • It employs large-kernel convolutions, partial self-attention, and a unified training-inference framework to boost spatial context and minimize computational cost.
  • Empirical studies on COCO and specialized benchmarks confirm YOLOv10’s superior accuracy–latency trade-offs, making it ideal for embedded and edge deployments.

YOLOv10 is a one-stage, real-time object detection model that introduces efficiency-driven architectural designs and a consistent dual-assignment (CDA) label strategy to achieve NMS-free inference. Developed as part of the decadal progression of the YOLO (“You Only Look Once”) family, YOLOv10 advances the detection accuracy–speed frontier through holistic module redesign and end-to-end optimization across all model scales. It is characterized by modular variants (nano to extra-large), a highly efficient spatial–channel decoupled backbone, selective use of large-kernel convolutions, and a unified training-inference framework that eliminates the need for post-hoc Non-Maximum Suppression. A variety of empirical studies confirm state-of-the-art accuracy, latency, and parameter trade-offs on both general (COCO 2017) and domain-specific (agriculture, SAR, fisheries) benchmarks (Wang et al., 2024, Hussain, 2024, Saltık et al., 2024, Tariq et al., 14 Apr 2025, Yu et al., 1 Sep 2025, Wuntu et al., 22 Sep 2025, Alif et al., 2024).

1. Core Architectural Features

YOLOv10 adopts a canonical “Stem → Backbone → Neck → Head” topology that is extensible from ultra-lightweight (nano) to large, high-accuracy models. Major innovations introduced at the architectural level include:

  • Backbone: Built upon optimized Cross-Stage Partial (CSP) modules, YOLOv10 integrates spatial–channel decoupled downsampling in lieu of standard stride-2 convolutions. Each downsampling module splits the feature map into spatial and channel components, processing them independently before concatenation, reducing computational cost and preserving feature diversity. Rank-guided block allocation concentrates parameters at informative locations, as determined by the intrinsic rank of output convolution weights. On the “small” and “nano” scales, the backbone uses a reduced number of CSP blocks and channel widths (e.g., 2.3 M parameters and 6.7 GFLOPs for YOLOv10-N).
  • Large-Kernel Convolutions: Deep stages substitute multiple 3×3 convolutions with large-kernel convolutions (e.g., 7×7 or 15×15) to boost spatial context, particularly for scale-invariant recognition. The deployment is selective to avoid excessive FLOPs in high-resolution layers.
  • Partial Self-Attention: Starting from stage four, half of the feature channels pass through one or more multi-head self-attention blocks (batch-normed instead of layer-normed), then fused back to the main path. This yields improved localization with a modest increase in compute and latency.
  • Neck: The model uses an efficiency-tuned Path Aggregation Network (PANet) for multi-scale feature fusion, augmented with spatial–channel decoupled downsampling at branch points. Models designed for fish and SAR applications insert a Pyramid Spatial Attention (PSA) block to aggregate across spatial scales.
  • Head: YOLOv10 introduces a dual-branch detection head, with anchor-free, per-location predictions. Parallel branches support one-to-many (for recall-rich training) and one-to-one (for NMS-free inference) assignments. The classification head is reduced to two 3×3 depthwise separable convolutions and one 1×1 pointwise convolution, substantially lowering parameter count and compute versus prior YOLO iterations.
  • Variants and Model Sizes:

| Variant | Params (M) | FLOPs (G) | Typical AP (%) | Latency (ms, FP16) | |-----------|------------|-----------|----------------|--------------------| | YOLOv10-N | 2.3 | 6.7 | 38.5–39.5 | 1.84 | | YOLOv10-S | 7.2 | 21.6 | 46.3–46.8 | 2.49 | | YOLOv10-M | 15.4 | 59.1 | 51.1–51.3 | 4.74 | | YOLOv10-L | 24.4 | 120.3 | 53.2–53.4 | 7.28 | | YOLOv10-X | 29.5 | 160.4 | 54.4 | 10.70 |

(Wang et al., 2024, Hussain, 2024, Saltık et al., 2024)

2. Consistent Dual-Assignment and NMS-Free Inference

YOLOv10 replaces the non-maximum suppression (NMS) mechanism required by previous YOLO models with a consistent dual-assignment (CDA) label strategy:

  • Dual Label Assignment: At training, each ground-truth box receives both one-to-many (o2m; recall-optimized, task-aligned) and one-to-one (o2o; precision-optimized, unique match) assignments, using a matching metric

m(α,β)=spαIoU(b^,b)βm(\alpha, \beta) = s \cdot p^\alpha \cdot |\mathrm{IoU}(\hat{b}, b)|^\beta

(where pp is the predicted class probability, and s{0,1}s \in \{0,1\} is a spatial prior).

  • Loss Aggregation: Branch losses are combined,

L=Lo2m+Lo2o\mathcal{L} = \mathcal{L}_{\mathrm{o2m}} + \mathcal{L}_{\mathrm{o2o}}

with each branch comprising a weighted sum of classification (binary cross-entropy), localization (CIoU), and distributional focal loss (DFL) for bounding-box bin regression.

  • Inference: At test time, only the o2o head—which uniquely assigns boxes to ground truths—is retained, producing NMS-free outputs. This strategy minimizes supervision mismatch between train and test, improves end-to-end latency by 20–40%, and yields a consistent ∼1–2% improvement in mAP (Wang et al., 2024, Hussain, 2024).

3. Training Protocols and Hyperparameters

YOLOv10 adopts diverse augmentation, sampling, and learning schedules to optimize both performance and practical deployment:

  • Data Augmentation: Mosaic (4-image compositing), MixUp, CutMix, random resizing, random color/hue-space jitter, and random horizontal flips are used throughout training (Hussain, 2024, Alif et al., 2024).
  • Anchor Handling: Predominantly anchor-free, with center-based assignments and per-pixel regression. Special cases (e.g., YOLOv10-nano on fish datasets) use 9 precomputed anchors per three scales, clustered by k-means (Wuntu et al., 22 Sep 2025).
  • Optimizers and Schedules: Stochastic gradient descent (SGD) with momentum 0.937 and weight decay 5e-4 is the standard. Learning rate schedules follow linear or cosine annealing, with linear warmup over initial epochs and decay to 1e-4 or 1e-5. Batch sizes are selected to fully utilize available GPU RAM (typically 16–64 per GPU) (Wang et al., 2024, Alif et al., 2024).
  • Regularization: Consistency between o2m and o2o assignments is enforced by an L2 penalty,

Ldual=γipi(coarse)pi(fine)22\mathcal{L}_{\mathrm{dual}} = \gamma \sum_i \| p_i^{(\text{coarse})} - p_i^{(\text{fine})} \|_2^2

where γ\gamma is a balancing coefficient (Alif et al., 2024).

4. Quantitative Performance and Empirical Comparison

YOLOv10 attains state-of-the-art accuracy–efficiency trade-offs across diverse detection domains:

  • COCO 2017 Benchmarks (latency on NVIDIA T4, FP16, batch=1):

| Model | AP (%) | Latency (ms) | Params (M) | FLOPs (G) | |----------------|--------|--------------|------------|-----------| | YOLOv10-N | 38.5 | 1.84 | 2.3 | 6.7 | | YOLOv10-S | 46.3 | 2.49 | 7.2 | 21.6 | | YOLOv10-M | 51.1 | 4.74 | 15.4 | 59.1 | | YOLOv10-L | 53.2 | 7.28 | 24.4 | 120.3 | | YOLOv10-X | 54.4 | 10.70 | 29.5 | 160.4 |

  • Small Object Sensitivity: YOLOv10-N achieves [email protected] 50.74% (overall), 48.26% (small objects ≈1% area), outperforming YOLOv9 and YOLOv8 on small object recall at matched inference speed (Tariq et al., 14 Apr 2025).
  • Smart Agriculture (weed/crop detection dataset, RTX 3080 laptop, 640 px):
    • YOLOv10-n: 92.4% mAP50, 21.5 ms latency (≈46 FPS)
    • YOLOv10-s: 93.1% mAP50, 21.8 ms
    • YOLOv10-m: 93.4% mAP50, 25.7 ms
  • Edge/Fish Detection (DeepFish, Intel i7 CPU, YOLOv10-n):
  • Synthetic Aperture Radar (SARDet-100K, YOLOv10-N):

YOLOv10’s empirical results consistently demonstrate a “sweet-spot” on the accuracy–latency curve, especially for real-time and resource-constrained scenarios.

5. Trade-Offs, Ablations, and Deployment

  • Ablation Studies:
    • Removing rank-guided blocks reduces AP by 1.3% and efficacy.
    • Returning to standard stride-2 convolutions adds ≈15% FLOPs for a 0.9% AP loss.
    • Omission of the dual-assignment regularizer increases latency (by 10 ms) with negligible accuracy change, but reduces consistency (Alif et al., 2024).
  • Model Scaling and Edge Deployment:
    • YOLOv10-N runs at ≈550 FPS (A100), ≈150 FPS (Jetson Xavier NX), and ≥30 FPS on desktop CPUs (OpenVINO, ONNX Runtime).
    • All standard variants (N, S, M, L, X) fit into ≤256 MB RAM, with on-disk sizes from ≈9 MB (N) up to ≈35 MB (X).
    • For maximum FPS with moderate accuracy, the “nano” variant (YOLOv10-n) is optimal. The “small” variant provides a strong balance (>46% AP, ≈2.5 ms latency).
    • Quantization to int8 or pruning can further reduce memory and computational requirements without severe accuracy degradation (Wuntu et al., 22 Sep 2025, Hussain, 2024).

6. Domain-Specific Adaptation and Specialized Use

YOLOv10 has been successfully adapted and optimized for multiple application-specific domains:

  • SAR Object Detection: Neural architecture search (NAS) on the YOLOv10 backbone produces SAR-NAS variants that reduce deep-layer redundancy, yielding mAP improvements of +0.71% at 0.57 G fewer FLOPs on the SARDet-100K dataset (Yu et al., 1 Sep 2025).
  • Fish and Marine Biodiversity: YOLOv10-nano, equipped with CSPNet, PAN, and PSA, attains 0.966 mAP50 at <3 M parameters, outperforming much heavier models in resource-constrained aquatic research deployments (Wuntu et al., 22 Sep 2025).
  • Smart Agriculture/Weed Management: Across resolutions and model sizes, YOLOv10 delivers >93% mAP50 in real-time weed/crop detection, confirming effective feature extraction for fine-grained, dense-field imagery (Saltık et al., 2024).

A common pattern is the strong transferability of the decoupled, channel-pruned backbone and NMS-free head to domains with small, densely packed, or ambiguous objects.

7. Limitations and Future Directions

  • In Dense Scenes: NMS-free inference can yield overlapping detections in extremely crowded settings; soft-NMS or hybrid post-processing may refine output (Hussain, 2024).
  • Partial Self-Attention Overhead: While PSA adds <1% GFLOPs, further pruning may be necessary for sub-watt NPUs.
  • Scalable Receptive Field: Dynamic large-kernel convolutions (“big-little” strategies) and learned kernel sizing are prospective enhancements.
  • Model Compression: Post-training quantization and cross-scale distillation are active areas for further reduction of memory/compute while maintaining AP.
  • Unified Backbone Search: Extensions to multi-task and multimodal backbones via NAS or other hardware-aware search techniques offer promising gains in specialized domains (SAR, medical imaging).

YOLOv10’s combination of efficient label assignment, modular architecture, and strong empirical performance positions it as a recommended baseline for real-time, embedded, and edge deployments across a heterogeneous set of object detection tasks (Wang et al., 2024, Hussain, 2024, Saltık et al., 2024, Tariq et al., 14 Apr 2025, Yu et al., 1 Sep 2025, Wuntu et al., 22 Sep 2025, Alif et al., 2024).

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to YOLOv10 Object Detection Model.