TCLeaf-Net: Hybrid Detector for Leaf Diseases
- TCLeaf-Net is a transformer-convolution hybrid model designed for robust, lesion-level detection in plant leaves under complex field conditions.
- It integrates modules like the Transformer-Convolution Module (TCM), Raw-Scale Feature Recalling and Sampling (RSFRS), and Deformable Feature-Alignment Pyramid (DFPN) to overcome challenges such as background clutter and spatial misalignments.
- Empirical results on the Daylily-Leaf dataset demonstrate improved accuracy and efficiency over baseline models, facilitating early and precise plant disease monitoring.
TCLeaf-Net is a transformer-convolution hybrid network designed for robust lesion-level detection of plant leaf diseases in real-field agricultural environments, where cluttered backgrounds, domain shifts, and small annotated datasets pose substantial challenges for conventional object detectors. The framework specifically targets the problem of early, spot-level monitoring to enable prompt intervention, leveraging a newly introduced paired dataset—Daylily-Leaf—with 1,746 RGB images and 7,839 fine-grained lesion annotations captured under both ideal and typical in-field conditions. The architecture integrates global and local attention mechanisms with deformable multi-scale alignment, optimized for efficiency and generalization across different plant disease datasets (Song et al., 13 Dec 2025).
1. Problem Setting and Motivations
TCLeaf-Net aims to address three core technical obstacles: (1) suppression of background clutter that frequently mimics lesion patterns (e.g., soil cracks, dead vegetation), (2) preservation of fine spatial details often lost during aggressive downsampling, and (3) reliable multi-scale fusion under varied lesion sizes and spatial misalignments. These issues are prominent in field scenarios, where variability in lighting, occlusion, and background complexity profoundly degrade detector robustness. The release of Daylily-Leaf, a dual-collection dataset spanning both controlled and in-field image conditions, is central, providing a testbed for lesion-level model development and benchmarking.
2. TCLeaf-Net Architecture
TCLeaf-Net comprises three principal components: the transformer-convolution hybrid backbone (TC backbone), raw-scale feature recalling and sampling (RSFRS) block, and a deformable feature-alignment pyramid (DFPN). The data flow is organized as follows:
2.1 Transformer-Convolution Module (TCM)
The TCM executes four consecutive Transformer-Convolution Layers (TCL), each splitting the input tensor into three parallel branches:
- Global Attention Module (GAM) utilizes Efficient Attention (EA), which approximates softmax self-attention via random feature maps, reducing complexity from (where ) to with :
- Local Attention Module (LAM) operates through Conv–BN–ReLU blocks:
- Residual Branch (RB) facilitates propagation of the original input for robust feature representation.
Fusion is performed by concatenation:
This hybrid mechanism enables focused attention on lesion patterns while maintaining global context, with Efficient Attention supporting global modeling on mid-range GPUs.
2.2 Raw-Scale Feature Recalling and Sampling (RSFRS)
RSFRS is devised to minimize information loss during resolution reduction, merging learnable convolutional downsampling with bilinear resampling:
This dual-path operation maintains compact feature maps while preserving spatial cues, crucial for small-lesion discrimination. Preceding RSFRS is Small-Step Overlapping Patch Embedding (SSOPE), a , stride-2 convolution that emphasizes boundary continuity as opposed to standard non-overlapping partitioning.
2.3 Deformable Feature-Alignment Pyramid (DFPN)
DFPN substitutes standard FPN top-down fusion with a Deformable Alignment Block (DAB) integrating:
- Multi-Receptive-Field Perception (MRFP) via parallel depthwise separable convolutions (DWConv) of different kernel sizes,
- Feature Selection Module (FSM) for channel-wise weighting,
- Deformable Convolution (DConv) guided by learned offsets.
Concretely:
This approach strengthens multi-scale fusion and spatial alignment, especially for small and irregular lesion regions.
3. Training Protocols and Dataset Properties
TCLeaf-Net optimization utilizes composite loss functions: Complete IoU (CIoU) regression, Distribution Focal Loss (DFL) for bounding box precision, and binary cross-entropy for objectness and class scores. Total loss is expressed as:
Stochastic Gradient Descent (SGD) is applied with an initial learning rate of 0.001 and momentum 0.937 over 200 epochs. Input resolution is standardized at , batch size 16.
The Daylily-Leaf dataset comprises 1,746 images—569 train/244 val for ideal split (3,877/1,295 boxes) and 653 train/280 val for in-field split (1,788/879 boxes), covering three lesion classes: Rust, Others, Mid-Late. Data augmentation includes geometric and photometric transforms and weather simulation techniques.
4. Empirical Performance and Comparative Analysis
Performance overview, including in-field split results (all metrics trace to (Song et al., 13 Dec 2025)):
| Model | mAP@50 (%) | GFLOPs | Params (M) |
|---|---|---|---|
| TCLeaf-Net | 78.2 | 157.9 | 46.1 |
| YOLOv8L | 72.8 | 165.4 | 43.6 |
| YOLOv3 | 74.2 | — | — |
| RTDETR-R50 | 52.6 | — | — |
| YOLO12X | 69.2 | — | — |
TCLeaf-Net achieves a 5.4 percentage point improvement in mAP@50 over baseline YOLOv8L and reduces computational cost by 7.5 GFLOPs, with GPU memory usage improvement of ≈8.7%. The cumulative robustness drop from ideal to in-field scenarios is 33.8 percentage points, the smallest among fourteen tested detectors. Superior cross-dataset generalization is demonstrated: PlantDoc (mAP@50 = 65.7%, F1 = 60.8%), Tomato-Leaf (mAP@50 = 94.6%, F1 = 90.3%), Rice-Leaf (mAP@50 = 57.3%, F1 = 59.9%).
5. Module-Wise Ablations and Design Insights
Ablation studies reveal the incremental effects of the main modules relative to the YOLOv8L baseline on Daylily-Leaf (in-field):
| Modules Added | mAP@50 (%) | GFLOPs Change | Params Change (M) |
|---|---|---|---|
| TCM only | 75.4 | ↓31.6 | ↓5.4 |
| RSFRS only | 74.1 | ↑8.4 | ↑2.7 |
| DFPN only | 73.4 | ↑14.8 | ↑4.5 |
| TCM+RSFRS | 75.9 | — | — |
| TCM+DFPN | 76.0 | — | — |
| Full TCLeaf-Net | 78.2 | ↓7.5 | ↑2.5 |
Further ablations:
- SSOPE () combined with RSFRS lifts mAP@50 from 77.1% to 78.2%.
- DFPN improves over conventional necks: PANet (76.3%), FPN/BiFPN (75.9% each), DFPN (78.2%).
- TCL branch analysis: LAM-only yields mAP@50 = 74.1%, GAM-only = 75.2%, GAM+LAM = 78.2%.
- Efficient Attention in GAM matches MHSA and cross-attention in accuracy but drastically reduces GPU requirements (9.4 GB vs. 25–26 GB).
Observed design principles include the dominant contribution of TCM to both accuracy and efficiency, RSFRS’s role in recall and detail preservation, and DFPN’s alignment for small-lesion detection. The combined global-local attention focusing, enabled by GAM+LAM, is central to lesion robustness.
6. Generalization and Limitations
TCLeaf-Net demonstrates robustness to domain shifts and competitive cross-dataset generalization, with consistent improvements over recent YOLO and RT-DETR architectures. Key limitations include moderate drops in performance on noisier external datasets (e.g., Rice-Leaf), indicating room for further adaptation to variable quality and unseen species. Efficient Attention is particularly suited for moderate-resource environments, a notable practical benefit.
A plausible implication is that TCLeaf-Net’s architecture—synergistically assembling TCM, RSFRS, and DFPN—may guide future designs in field-deployable agricultural computer vision, focusing on efficient attention mechanisms, fine-scale feature preservation, and adaptive multi-scale alignment.