P2 Detection Head in Object Detection
- P2 detection head is a high-resolution branch that operates on 1/4-scale feature maps to improve small object detection.
- It integrates into multi-scale architectures like YOLOv8 and MHD-Net, using convolutional blocks and center-sampling for precise localization.
- Empirical results show enhanced recall and precision in domains such as agronomy and traffic surveillance with moderate computational impact.
The P2 detection head refers to a specialized high-resolution branch in multi-scale object detection architectures, designed to enhance the detection of very small objects. "P2" denotes the feature pyramid level at a stride of 4 pixels relative to the input, corresponding to the highest-resolution output typically produced by the network's feature aggregation module. Integration of a P2 head extends the canonical multi-branch design (conventionally p3, p4, p5 strides at 8, 16, 32) by attaching an additional detection layer to the 1/4-scale feature map, explicitly targeting small-scale objects whose features are lost at coarser resolutions. This mechanism has found practical application in fields requiring precise small instance detection, such as phenotyping in agronomy and traffic object detection (Chen et al., 28 Jul 2025, Shi et al., 2022).
1. Architectural Integration and Multi-Scale Context
Modern object detectors, notably the YOLO and MHD-Net families, employ feature pyramid architectures to handle the heterogeneity of object scales. The P2 detection head is introduced as an additional detection branch, interfacing with the 1/4-resolution feature map () provided by a feature fusion neck such as a BiFPN. For example, in the improved YOLOv8 architecture (Chen et al., 28 Jul 2025), four detection heads (p2, p3, p4, p5) are attached to successively downsampled outputs at spatial resolutions of 160×160, 80×80, 40×40, and 20×20 for a 640×640 input. The P2 head processes , enabling direct access to finer-grained spatial cues crucial for small object localization. Similarly, MHD-Net (Shi et al., 2022) indexes detection heads H1–H5, with H1 denoting the P2 branch at stride 4, and advocates selective head configuration based on dataset-specific object scale distributions.
2. Layer-Wise Structure and Head Design
The P2 head reuses the typical convolutional blocks and detection modules found in anchor-free detectors. In YOLOv8s-p2 (Chen et al., 28 Jul 2025), the block is structured as:
- Input: Feature map (post-BiFPN, ).
- Detector head (per grid cell):
1. convolution with output channels, batch normalization, SiLU activation. 2. convolution with output channels, batch normalization, SiLU activation. 3. convolution restoring channels, batch normalization, SiLU activation. 4. For each prediction head (objectness, class, box): convolution (output: channels); , = number of classes.
- Output: Raw predictions are followed by sigmoid/activation during inference.
No additional upsampling or lateral fusion is required beyond what the neck provides. The head operates directly on the high-resolution feature, maximizing retention of localized spatial information.
3. Object-Scale Matching and Head Assignment
Optimal allocation of detection heads to object scales is achieved through empirical distribution analysis. MHD-Net (Shi et al., 2022) formalizes head-object matching by defining, for each head , a coverage ratio indicating the fraction of ground-truth boxes whose area falls within head 's effective receptive range. The scale range for head is specified as:
where is the original image width, input width, and the area in .
Empirical studies indicate that at lower input resolutions, P2/H1 heads capture a significant portion of small objects, while at higher resolutions, their marginal benefit diminishes. MHD-Net recommends selecting two "cross-scale" heads such that their cumulative covers of true objects, discarding intermediate heads to improve efficiency.
4. Loss Functions, Assignment, and Mathematical Details
All heads, including P2, share common loss terms:
- : Complete IoU (CIoU) loss for bounding box regression.
- : Binary cross-entropy (BCE) on objectness.
- : BCE on classification output.
Dynamic -matching assigns positive samples across all heads, with IoU threshold at $0.5$ (Chen et al., 28 Jul 2025). For anchor-free prediction, P2 employs center-sampling; each grid point is responsible for boxes whose centers lie within a radius cells.
Decoding a prediction at on F2 (stride ) follows:
Analogous expressions apply for .
5. Empirical Efficacy and Performance Impact
P2 heads are empirically validated to enhance detection of small objects under challenging conditions. In the rice spikelet flowering detection task (Chen et al., 28 Jul 2025), augmenting YOLOv8s with a p2 head yields notable improvements:
| Model | [email protected] (%) | Precision (%) | Recall (%) | F1-score (%) | FPS |
|---|---|---|---|---|---|
| Baseline YOLOv8s | 62.8 | 59.2 | 50.7 | 54.6 | 109 |
| +p2 head | 65.9 | 67.6 | 61.5 | 64.4 | 69 |
Performance gains are prominent in recall and precision (ΔR=+10.8%, ΔP=+8.4%), attributable to the improved grid resolution for sub-20×20px objects. Inference speed remains within real-time constraints.
Similar findings emerge in MHD-Net (Shi et al., 2022), where judicious use of P2 (H1) and P8 (H3) heads, supplemented with a lightweight dilated-convolutional block, results in a ~30–40% reduction in parameters/FLOPs while preserving or surpassing the baseline mean average precision on challenging datasets such as BDD100K and ETFOD-v2.
6. Architectural Variations and Augmentations
Enhancements to P2 utility can include receptive field expansion immediately upstream of the head. MHD-Net introduces parallel convolutions with dilation rates 1, 4, and 8, whose outputs are element-wise summed to aggregate multi-contextual features before the P2 head. This strategy bestows a +2.6 mAP point improvement with a negligible increase in parameters and computational demand (Shi et al., 2022). The combination of high-resolution localization and enlarged receptive field supports robust discrimination of small, context-sensitive instances.
7. Application Scope and Configuration Trade-Offs
P2 detection heads are especially effective in domains marked by dense, tiny objects recurring at unpredictable locations—agricultural monitoring, traffic surveillance, or crewed vehicle driver-attention monitoring. Empirical findings across studies underscore the importance of aligning the number and scale of heads to the true distribution of target sizes present in the data. Excessive specialization via many heads can inflate computational footprint without commensurate accuracy gains, while insufficient scale diversity can leave small objects underrepresented. Evaluations suggest that, with proper configuration, architectures leveraging a P2 head can nearly match the accuracy of much larger detectors with substantially reduced model complexity and higher inference efficiency (Chen et al., 28 Jul 2025, Shi et al., 2022).