YOLO11n: Nano YOLOv11 Object Detector
- YOLO11n is a nano-scale deep learning model in the YOLOv11 family, using lightweight modules like C3k2, SPPF, and C2PSA for efficient feature extraction.
- It employs a tailored backbone, neck, and head architecture with multi-scale fusion and spatial attention to achieve sub-3 ms inference and 30–50 FPS on edge devices.
- Integrated training regimes with advanced augmentations and knowledge distillation boost accuracy by up to +2.5% mAP while maintaining minimal computational load.
YOLO11n is the nano-scale member of the YOLOv11 (“You Only Look Once” version 11) object detection family, specifically tailored for high efficiency, minimal model size, and real-time inference on edge and embedded hardware. Building on architectural advances such as the C3k2 Cross-Stage Partial block, SPPF (Spatial Pyramid Pooling – Fast), and C2PSA (Cross-Stage Partial with Parallel Spatial Attention), YOLO11n achieves a substantial trade-off between detection accuracy and computational efficiency across diverse computer vision tasks, including object detection, segmentation, and counting, particularly in resource-constrained deployments (Khanam et al., 23 Oct 2024, Jegham et al., 31 Oct 2024, Jiang et al., 20 Feb 2025, Hidayatullah et al., 23 Jan 2025).
1. Architectural Innovations
YOLO11n’s architecture is defined by three principal components: backbone, neck, and head, each employing lightweight, computation-saving modules.
- Backbone: The main feature extractor initiates with two Conv(3×3, stride=2) layers, followed by a sequence of C3k2 blocks. The C3k2 module replaces larger convolutions with two parallel, smaller kernels (e.g., 2×2 or 3×3) and a channel-splitting strategy:
This yields ≈2× speedup and fewer parameters compared to YOLOv8’s C2f (Khanam et al., 23 Oct 2024, Hidayatullah et al., 23 Jan 2025, Jegham et al., 31 Oct 2024).
- SPPF: Spatial Pyramid Pooling – Fast, which recursively max-pools the feature map three times, concatenates, and applies Conv(1×1) to consolidate multi-scale receptive information at negligible FLOPs (Khanam et al., 23 Oct 2024).
- C2PSA: The Cross-Stage Partial with Parallel Spatial Attention injects spatial-attention learned masks into feature maps. The attention map is given by where are convolutions, and is sigmoid. Final features are modulated as (Khanam et al., 23 Oct 2024, Hidayatullah et al., 23 Jan 2025).
- Neck: Feature aggregation leverages FPN and PAN-style paths (top-down and bottom-up, respectively), with multi-scale feature fusion achieved through C3k2 units and upsampling operations.
- Head: The detection head consists of three parallel branches (small, medium, large object scales), each involving decoupled (separate) branches for classification, box regression, and objectness. Most implementations use anchor-free detection; some use decoupled anchor-based, with standard YOLO anchors and strides (8, 16, 32), inheriting the latest improvements found in YOLOv8 and YOLOv10 (Jiang et al., 20 Feb 2025, Hidayatullah et al., 23 Jan 2025, Jegham et al., 31 Oct 2024).
2. Model Scaling and Parameterization
The YOLO11n variant is distinguished by width and depth multipliers targeting aggressive model compaction:
| Model | Depth Multiplier | Width Multiplier | Approx. Params | Model Size | Compute (GFLOPs at 640²) |
|---|---|---|---|---|---|
| YOLO11n | 0.33 | 0.25 | ~2–3.4M | 2–6.4 MB | 4–6.3 |
| YOLO11s | 0.50 | 0.25 | ~12M | ~12 MB | 15 |
| YOLO11m | 0.50 | 0.50 | ~22M | ~22 MB | 32 |
| YOLO11l | 1.00 | 1.00 | ~48M | ~48 MB | 66 |
Exact layer-by-layer details and feature map sizes are outlined in (Hidayatullah et al., 23 Jan 2025), with YOLO11n typically capped at 64–128 backbone channels. SPPF and C2PSA operate at the lowest resolution stage (e.g., 20×20 for 640² inputs) (Khanam et al., 23 Oct 2024, Hidayatullah et al., 23 Jan 2025).
The result is an ultra-light model (sub-10 MB on disk, 4–8 GFLOPs) capable of sub-3 ms inference on high-end GPUs, and real-time (30–50 FPS) even with limited embedded hardware (Jegham et al., 31 Oct 2024, Sapkota et al., 1 Jul 2024, Saltık et al., 16 Jul 2025).
3. Training Regimes and Loss Functions
YOLO11n uses a unified training pipeline across the YOLOv11 series, with the following core settings:
- Input: typically RGB images (Khanam et al., 23 Oct 2024, Hidayatullah et al., 23 Jan 2025).
- Data augmentation: mosaic, mixup, multi-scale training, rotation, geometric and photometric transformations (Sapkota et al., 1 Jul 2024, Rasool et al., 7 Jul 2025).
- Optimization: SGD (momentum 0.937) or Adam(W), initial learning rate ≈ 0.01, cosine annealing, weight decay ≈ 0.0005, batch sizes 8–64, training for 300–700 epochs depending on dataset (Khanam et al., 23 Oct 2024, Alif, 30 Oct 2024, Jiang et al., 20 Feb 2025).
- Loss function: multi-term, with anchor-free loss (when used) as
where is Complete IoU loss, and are binary cross-entropy for objectness and class predictions, and is Distribution Focal Loss for distance regression (Hidayatullah et al., 23 Jan 2025, Jiang et al., 20 Feb 2025, Sapkota et al., 1 Jul 2024).
4. Benchmark Performance and Comparative Evaluation
Across numerous benchmarks, YOLO11n has consistently demonstrated a favorable trade-off between accuracy and latency:
| Dataset/Domain | Precision | Recall | [email protected] | mAP@[0.5–0.95] | Inference Time | Reference |
|---|---|---|---|---|---|---|
| COCO | — | — | 0.40–0.45 | — | 1–3 ms (A100) | (Khanam et al., 23 Oct 2024) |
| Traffic Signs | 0.768 | 0.695 | 0.757 | 0.668 | 2.2 ms | (Jegham et al., 31 Oct 2024) |
| Africa Wildlife | 0.964 | 0.877 | 0.964 | 0.802 | 2.2 ms | (Jegham et al., 31 Oct 2024) |
| Ships & Vessels (tiny objects) | 0.574 | 0.510 | 0.505 | 0.311 | 2.5 ms | (Jegham et al., 31 Oct 2024) |
| Weed detection (Jetson) | 0.99 | ~1.0 | 0.98 | — | <250 ms/frame | (Rasool et al., 7 Jul 2025) |
| Green fruitlet detection | 0.897 | 0.868 | 0.926 | — | 2.4 ms | (Sapkota et al., 1 Jul 2024) |
| Weed detection (KD, sugar beet) | — | — | 0.838 | — | 20.98 ms/FP16 | (Saltık et al., 16 Jul 2025) |
YOLO11n outpaces prior “nano” or “tiny” YOLOs (YOLOv8n, YOLOv9t, YOLOv10n) by 3–8 mAP@[0.5–0.95] points, while remaining smallest in FLOPs and disk footprint (Jegham et al., 31 Oct 2024, Jiang et al., 20 Feb 2025, Sapkota et al., 1 Jul 2024).
5. Embedded and Real-Time Deployment
YOLO11n is highly optimized for deployment on embedded platforms, including NVIDIA Jetson Orin Nano, Jetson Xavier NX, and Raspberry Pi 5. The lightweight design (2.2–6.4 MB, 4–6 GFLOPs) enables real-time or near-real-time images per second, even under resource constraints:
- Jetson Orin Nano (FP16, TensorRT): 20.98 ms/frame (≈47 FPS) (Saltık et al., 16 Jul 2025).
- ARM CPU, FP16 NEON: 63.76 ms/frame (≈15.7 FPS) (Saltık et al., 16 Jul 2025).
- Jetson Orin Nano: sub-250 ms end-to-end cycle including image capture and actuation (Rasool et al., 7 Jul 2025).
- iPhone 14 Pro real-time counting: RMSE 3.06–4.96; MAE 2.33–7.73 (orchard fruit counting) (Sapkota et al., 1 Jul 2024).
- Outperforms YOLOv8n, YOLOv9 g-s, YOLOv10n, and YOLOv12n in end-to-end speed (e.g., YOLOv11n = 2.4 ms vs YOLOv8n = 4.1 ms inference) (Sapkota et al., 1 Jul 2024).
The network structure is well-suited to post-training quantization (INT8) and exporter pipelines to ONNX/TensorRT (Khanam et al., 23 Oct 2024, Rasool et al., 7 Jul 2025).
6. Knowledge Distillation and Model Enhancement
Channel-wise Knowledge Distillation (CWD) and Masked Generative Distillation (MGD) can boost YOLO11n accuracy by up to +2.5% [email protected] on weed detection tasks without increasing model size or computational load. CWD aligns channel-wise spatial distributions via KL divergence at softened temperature, while MGD uses spatial masking and feature projection. Both strategies yield stable improvements across varying seeds and maintain real-time deployment, e.g., 47.7 FPS on Jetson Orin Nano (FP16) (Saltık et al., 16 Jul 2025).
7. Applications and Limitations
YOLO11n has demonstrated best-in-class edge and robotics deployment performance in:
- Variable-rate weed spraying robots ([email protected] = 0.98, 4 FPS including actuation) (Rasool et al., 7 Jul 2025).
- Orchard fruit counting and immature fruitlet detection ([email protected] = 0.926, RMSE as low as 3.06) (Sapkota et al., 1 Jul 2024).
- Precision agriculture weed mapping under knowledge distillation (Saltık et al., 16 Jul 2025).
Strengths include ultra-low latency, compactness (sub-10 MB), and robustness to small and occluded object detection due to C2PSA (Jegham et al., 31 Oct 2024, Alif, 30 Oct 2024). Limitations persist under extreme lighting or occlusion, and segmentation or counting accuracy may trail heavier models.
Continued areas for improvement are heavier augmentation (shadow/domain adaptation), RGB-D fusion, and hybrid or online knowledge distillation. Model quantization, sensor-adaptive fine-tuning, and enhanced attention are plausible routes for further boosting field performance (Saltık et al., 16 Jul 2025, Sapkota et al., 1 Jul 2024).
References: (Khanam et al., 23 Oct 2024, Jegham et al., 31 Oct 2024, Jiang et al., 20 Feb 2025, Hidayatullah et al., 23 Jan 2025, Alif, 30 Oct 2024, Rasool et al., 7 Jul 2025, Saltık et al., 16 Jul 2025, Sapkota et al., 1 Jul 2024, Wong et al., 2019)