Papers
Topics
Authors
Recent
2000 character limit reached

DAONet-YOLOv8: Occlusion-Aware Object Detection

Updated 6 December 2025
  • The paper introduces a framework that integrates dual-attention fusion, dynamic synthesis convolution, and an occlusion-aware detection head for superior tea pest and disease detection.
  • It employs joint local-global contextual modeling and adaptive receptive fields to accurately delineate irregular lesion boundaries and handle dense occlusions.
  • Experimental results demonstrate improved precision, recall, and mAP with reduced parameters and computational cost compared to baseline YOLOv8n.

DAONet-YOLOv8 is an occlusion-aware object detection framework designed to address the challenges of pest and disease identification in tea plantations, where dense leaf occlusions, irregular lesion boundaries, and complex backgrounds hinder accurate detection. This architecture enhances the YOLOv8 baseline by systematically integrating three modules: a Dual-Attention Fusion Module (DAFM) for joint local-global contextual modeling, an occlusion-aware detection head (Detect-OAHead) for feature reweighting under partial visibility, and a C2f-DSConv unit utilizing dynamic synthesis convolution for adaptive receptive fields. Evaluated on a real-world tea leaf annotation dataset, DAONet-YOLOv8 yields superior precision, recall, and mean average precision, while reducing computational expense and parameter count relative to canonical YOLOv8n (Wu et al., 28 Nov 2025).

1. Network Architecture and Components

DAONet-YOLOv8 processes 640×640×3640\times 640 \times 3 input images via a backbone built from a stack of modified C2f-DSConv blocks. Standard C2f bottleneck units are replaced with a dynamic synthesis convolution composed of parallel square, vertical, and horizontal convolutional branches. After multi-stage downsampling and feature extraction, a Spatial Pyramid Pooling–Fast (SPPF) layer aggregates multi-scale signals. The output is fed to the Dual-Attention Fusion Module (DAFM), which merges spatially localized and globally attentive representations.

Subsequently, a neck comprising PANet and FPN modules fuses features across three scales, maintaining the original YOLOv8 neck topology. Each scale-specific fused feature is then processed by the occlusion-aware Detect-OAHead, which includes channel-interactive attention and depthwise separable convolutions for classification and bounding box regression. The final output consists of per-object objectness scores, six tea leaf class predictions, and bounding-box coordinates.

2. Dual-Attention Fusion Module (DAFM)

The DAFM is constructed to simultaneously emphasize local lesion details and global contextual cues. The first branch performs a 1×11\times 1 convolution followed by channel shuffling and a 3×33\times 3 convolution, enhancing localized discriminative patterns. The second branch generates query, key, and value tensors from lightweight convolutions and computes a global attention map via

A=Softmax(KQa)A = \mathrm{Softmax} \left(\frac{K Q}{a}\right)

with a learnable scaling parameter a>0a > 0. Applying the attention map to VV produces feature embeddings that, after a 3×33\times 3 convolution and residual connection, encode lesion context and interrelations. The outputs from both branches are fused by element-wise summation: Fout=Fconv+FattF_{\text{out}} = F_{\text{conv}} + F_{\text{att}} forming a contextually adaptive feature map supplying the neck and detection head.

3. C2f-DSConv: Dynamic Synthesis Convolution

Unlike conventional static bottlenecks, C2f-DSConv is composed of three parallel depthwise separable convolutions: a k×kk\times k square kernel, a vertical strip m×1m\times 1, and a horizontal strip 1×m1 \times m. Each branch is globally pooled to produce a channel-wise descriptor dd, from which fusion weights are predicted by a fully connected (FC) layer and softmax normalization, yielding α=[α1,α2,α3]\alpha = [\alpha_1, \alpha_2, \alpha_3] with ∑αi=1\sum \alpha_i = 1. Synthesized features are computed as

Fsynthesis=α1Fsq+α2Fv+α3FhF_{\text{synthesis}} = \alpha_1 F_{\text{sq}} + \alpha_2 F_{\text{v}} + \alpha_3 F_{\text{h}}

enabling dynamic adaptation to the spatial statistics of dense or irregular lesions. The result is concatenated with C2f outputs according to the original highway scheme.

4. Occlusion-Aware Detection Head (Detect-OAHead)

Detect-OAHead introduces per-branch channel weighting and gating to compensate for partial occlusion. Feature maps FF are passed through depthwise separable convolution and global average pooling to produce a vector zz. Two FC layers and nonlinear activation calculate normalized attention weights ww per channel. Features are gated by

F′=Fd⊙wF' = F_{d} \odot w

where ⊙\odot denotes channel-wise multiplication. These reweighted features propagate through 1×11\times 1 convolution to the final prediction layers for classification and regression. This mechanism explicitly attempts to learn the relationships between visible and occluded regions in a parameter-efficient manner.

5. Training Losses and Experimental Setup

The loss function is not customized beyond the YOLOv8 three-term formulation: L=λboxLbox+λobjLobj+λclsLcls\mathcal{L} = \lambda_{\text{box}} \mathcal{L}_{\text{box}} + \lambda_{\text{obj}} \mathcal{L}_{\text{obj}} + \lambda_{\text{cls}} \mathcal{L}_{\text{cls}} where Lbox\mathcal{L}_{\text{box}} is CIoU for bounding boxes, Lobj\mathcal{L}_{\text{obj}} is binary cross-entropy for objectness, and Lcls\mathcal{L}_{\text{cls}} is categorical cross-entropy for class assignment. There is no explicit loss term to compensate for occlusion effects; all occlusion handling is architectural. Training hyperparameters (learning rate, epochs, batch size) are not reported. The implementation utilizes PyTorch and an RTX 4060 GPU.

The evaluation dataset comprises 2,496 images of tea plants, annotated for six label categories (anthracnose, red leaf disease, tea sooty mold, tea geometrid pest, white spot disease, healthy leaves). Images were acquired in July 2025 in natural lighting and resized for model input. No special occlusion labelling is present, although partial lesions are implicitly marked.

6. Quantitative Performance and Comparative Analysis

Test results demonstrate DAONet-YOLOv8 achieves 0.9297 precision, 0.9280 recall, 0.9710 mAP@50, and 0.7690 mAP@50:95; the parameter count is 2.5M and the GFLOPs requirement is 5.5. Compared to YOLOv8n, this constitutes improvements of +2.34% precision, +4.68% recall, +1.40% mAP@50, and +1.80% mAP@50:95, alongside a 16.7% reduction in parameter count and a 32% decrease in GFLOPs.

Model Precision Recall mAP@50 mAP@50:95 Params (M) GFLOPs
DAONet-YOLOv8 0.9297 0.9280 0.9710 0.7690 2.5 5.5
YOLOv8n 0.9063 0.8812 0.9570 0.7510 3.0 8.1

Ablation studies confirm complementary effectiveness of all modules: DAFM improves lesion focus, DSConv increases ability to delineate irregular boundaries, and OAHead enhances robustness under occlusion. DAONet-YOLOv8 ranks highest in mAP@50:95 among all evaluated detectors while maintaining real-time performance.

7. Qualitative Evaluation and Module Impact

Visual analysis with Grad-CAM and bounding box outputs indicates DAONet-YOLOv8 consistently generates fewer false positives, more compact high-confidence boxes, and focuses attention sharply on core lesion regions, even in the presence of dense leaf-branch occlusion and complex backgrounds. In contrast, baseline YOLOv8n exhibits diffused activations and overlapping low-confidence predictions. Ablation results show that combining all three modules yields the most robust and precise detection, confirming their distinct contributions to feature aggregation, occlusion compensation, and boundary modeling.

8. Summary and Implications

DAONet-YOLOv8 advances the state of tea leaf pest and disease detection by synergistically integrating dual-attention fusion, occlusion-aware feature reweighting, and dynamic kernel synthesis. These architectural innovations produce substantial, rigorously validated improvements in both accuracy and efficiency metrics over standard YOLOv8n, substantiating the value of hybrid local-global modeling and explicit occlusion handling in real-world agricultural detection tasks (Wu et al., 28 Nov 2025). A plausible implication is the generalizability of such architectural principles to other small-object and occlusion-dominated detection domains—subject to further empirical validation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to DAONet-YOLOv8.