Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
GPT-5.1
GPT-5.1 104 tok/s
Gemini 3.0 Pro 36 tok/s Pro
Gemini 2.5 Flash 133 tok/s Pro
Kimi K2 216 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

YOLOv11 Architecture: Advanced Object Detection

Updated 13 November 2025
  • YOLOv11 is a deep convolutional object detector that employs C3k2 and C2PSA modules to enhance efficiency and precision in processing diverse object scales.
  • The architecture leverages an anchor-free, multi-scale detection approach with a robust backbone, neck, and detection head to improve mAP and inference speed.
  • YOLOv11 consistently demonstrates higher accuracy and real-time performance, making it effective for challenging applications like vehicle detection and medical imaging.

YOLOv11 is a deep convolutional object detector representing the eleventh major revision of the “You Only Look Once” (YOLO) family. It introduces two architectural modules—C3k2 and C2PSA—alongside optimizations in backbone structure, feature aggregation, and detection heads, resulting in superior speed, inference-time efficiency, and accuracy for dense and small-object detection. The model adopts an anchor-free, multi-scale detection paradigm, systematically enhancing performance on challenging domains like vehicle detection, medical imaging, and multispectral fusion.

1. Network Architecture and Data Flow

YOLOv11 is organized in three principal components: Backbone, Neck, and Detection Head. The network consumes an input image IRH×W×3\mathbf{I} \in \mathbb{R}^{H\times W\times 3} (standard: 416×416416 \times 416) and processes it as follows:

  • Backbone: Deep convolutional extractor leveraging C3k2 blocks, standard strided 3×3 convolutions for spatial downsampling, and SPPF (Spatial Pyramid Pooling—Fast) for receptive field expansion. C2PSA (Cross-Stage Partial with Spatial Attention) is interleaved to strengthen focus on salient regions.
  • Neck: Lightweight upsample-and-concatenate path with additional C3k2 and C2PSA blocks, performing bidirectional feature aggregation at scales P3P_3 (stride 8), P4P_4 (stride 16), and P5P_5 (stride 32).
  • Detection Head: Three parallel heads applied to P3P_3, P4P_4, and P5P_5 feature maps. Each head predicts, per spatial position: objectness score, per-class probabilities, and bounding box offsets (4D anchor-free regression).

The block-wise data flow can be summarized:

1
2
3
4
5
6
7
8
9
Input (416×416)
  ↓ Conv1/Conv2 downsampling (strided 3×3 convolutions)
  ↓ C3k2 / SPPF / C2PSA (Backbone)
  ↓ Multi-scale feature maps
  ↓ Upsample + Concat + C3k2 + C2PSA (Neck)
  ↓ P3/P4/P5 feature maps
  ┆─ Detect(P3): predicts small objects
  ┆─ Detect(P4): predicts medium objects
  ┆─ Detect(P5): predicts large objects

2. Key Architectural Innovations and Mathematical Formulation

C3k2 Block

The C3k2 block is a computationally optimized Cross-Stage Partial (CSP) module that splits input features, applies two 2×22\times2 convolutions per path, and fuses outputs by concatenation and 1×11\times1 projection:

Split(X)=(X1,X2),C3k2(X)=Concat[Conv2×2(X1),Conv2×2(X2)]+Conv1×1(X)\mathrm{Split}(X) = (X_1, X_2), \quad \mathrm{C3k2}(X) = \mathrm{Concat}\left[\mathrm{Conv}_{2\times2}(X_1), \mathrm{Conv}_{2\times2}(X_2)\right] + \mathrm{Conv}_{1\times1}(X)

This design reduces per-block FLOPs by 3040%30{-}40\% versus prior 3×33\times3 kernels, minimizing memory traffic.

SPPF Block

Spatial Pyramid Pooling—Fast concatenates three max-pooled versions of the feature map to extend effective receptive field efficiently:

SPPF(X)=Concat(MaxPool(X,5),  MaxPool(X,3),  MaxPool(X,1))\mathrm{SPPF}(X) = \mathrm{Concat}\Big(\mathrm{MaxPool}(X, 5),\; \mathrm{MaxPool}(X, 3),\; \mathrm{MaxPool}(X, 1)\Big)

C2PSA Block

C2PSA (Cross-Stage Partial with Spatial Attention) injects lightweight attention over a CSP split. The implementation applies a per-pixel weight map using a 1×11\times1 convolution followed by a sigmoid:

C2PSA(X)=Attention(Concat[Xpath1,Xpath2])\mathrm{C2PSA}(X) = \mathrm{Attention}\Big(\mathrm{Concat}[X_{\text{path1}}, X_{\text{path2}}]\Big)

where “Attention” denotes a channel- or spatial-attention operator.

Detection Head

YOLOv11 is anchor-free, assigning each ground-truth box to the grid cell containing its center and producing a tuple (Δx,Δy,Δw,Δh)(\Delta x, \Delta y, \Delta w, \Delta h), objectness, and class scores per cell.

3. Loss Functions, Anchor Assignment, and Regularization

YOLOv11 employs a multi-task loss with localization, objectness, and classification components, parameterized as:

L=λboxi1iobjCIoU(tibox,gibox)+iBCE(piobj,1iobj)+λclsi1iobjBCE(picls,yicls)\mathcal{L} = \lambda_{\text{box}} \sum_{i} \mathbf{1}_i^{\text{obj}}\, \ell_{\text{CIoU}}(t_i^{\text{box}}, g_i^{\text{box}}) + \sum_{i} \ell_{\text{BCE}}(p_i^{\text{obj}}, \mathbf{1}_i^{\text{obj}}) + \lambda_{\text{cls}} \sum_{i} \mathbf{1}_i^{\text{obj}}\, \ell_{\text{BCE}}(p_i^{\text{cls}}, y_i^{\text{cls}})

with indicator 1iobj\mathbf{1}_i^{\text{obj}} explicitly defined:

1iobj={1,if cell i contains GT center 0,otherwise\mathbf{1}_i^{\text{obj}} = \begin{cases} 1, & \text{if cell } i \text{ contains GT center}\ 0, & \text{otherwise} \end{cases}

Regularization includes:

  • Weight decay 2\ell_2 with γ=0.0005\gamma = 0.0005
  • Batch Normalization after every conv
  • Cosine-annealed learning rate: ηt=η012(1+cos(tTπ))\eta_t = \eta_0\,\frac{1}{2}\bigl(1+\cos(\frac{t}{T}\pi)\bigr) for T=300T = 300 epochs, η0=0.01\eta_0 = 0.01

4. Quantitative Performance, Scaling, and Complexity

Although the paper does not supply exhaustive FLOPs or parameter counts per variant, reported metrics are as follows:

  • Inference speed: $290$ FPS (YOLOv11) vs. $280$ FPS (YOLOv10) at fixed hardware, a 3.6%3.6\% increase
  • Model size: Comparable to YOLOv10; implied 406040{-}60 M parameters in typical configurations
  • Detection accuracy: Improvement of +2.5%+2.5\% mAP@0.5[email protected] (74.3%74.3\% → 76.8%76.8\%), +1.8%+1.8\% mAP@[0.5:0.95]mAP@[0.5:0.95] (46.7%46.7\% → 48.5%48.5\%). Strongest gains observed for small, occluded vehicles (bicycles: 0.940.980.94\rightarrow0.98 mAP)
  • Computational complexity: C3k2 reduces per-block compute by 3040%30{-}40\%; C2PSA adds minimal overhead, improving discriminative power.

Scaling to smaller or larger variants follows the width/depth multipliers and channel capping conventions of the YOLO family.

5. Comparative Analysis and Evolution

Relative to YOLOv10, YOLOv11's major innovations are:

  • C3k2 replaces C2f (smaller kernels in CSP branches, lower FLOPs)
  • Integrated C2PSA attention in both backbone and neck (robustness to occlusion and small-object detection)
  • Retained SPPF, more tightly fused into deeper backbone layers
  • Maintained anchor-free assignment and multi-head detection, as established in YOLOv8

These changes yield improved accuracy for vehicle categories with challenging occlusions and geometries, while keeping inference time competitive for real-time deployment. Performance improvements derive from the combined effect of reduced computational overhead and enhanced spatial context modeling.

6. Real-world Applications and Practical Considerations

YOLOv11 is well-suited for real-time tasks in intelligent transportation, traffic monitoring, and autonomous driving, specifically due to:

  • Efficient detection of small, heavily occluded vehicles in dense scenes
  • Robustness against complex object geometries and partial visibility
  • Competitive throughput in high-speed monitoring pipelines

Implementers should note:

  • C3k2 and C2PSA modules can be constructed in deep learning frameworks following the provided formulas
  • Anchor-free detection logic requires grid cell assignment algorithms, as detailed above
  • Cosine learning-rate scheduling and weight decay should be preserved for optimal convergence
  • FLOPs and memory footprint must be profiled on target hardware; full model complexity depends on chosen width, depth, and resolution parameters

7. Limitations and Future Directions

The absence of exhaustive architectural tables and explicit receptive-field sizes in the primary manuscript means that practitioners should rely on official codebases or additional benchmarking to resolve layer-by-layer details. However, the modularity of C3k2 and C2PSA allows for direct extension or modification. Ongoing research may refine attention mechanisms, further accelerate fused block designs, or enable even finer-grained scaling to edge-constrained hardware.

In summary, YOLOv11 is architecturally distinguished by its use of smaller-kernel CSP blocks and spatial-attention modules, yielding systematically higher accuracy for small-object and occluded-object detection at high throughput, establishing a new performance baseline for real-time vehicle detection systems (Alif, 30 Oct 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to YOLOv11 Architecture.