- The paper introduces a cross-architecture distillation framework that transfers global reasoning from a ViT teacher to a compact MobileNetV2 student.
- It demonstrates that using multi-level loss objectives, the distilled CNN can match or exceed teacher performance while reducing parameters by over 170×.
- The framework achieves real-time, edge-ready leaf disease classification with sub-10ms latency and robust interpretability through Grad-CAM analyses.
Cross-Architecture Knowledge Distillation for Edge-Efficient Leaf Disease Recognition: An Analysis of AgriKD
Introduction
Automated disease classification in crops is vital for precision agriculture, where timely identification curbs crop loss and mitigates excessive agrochemical use. Deep learning, especially CNNs and more recently ViTs, has established high accuracy on benchmark datasets. However, the deployment of high-capacity models like ViTs is substantially limited in edge environments, primarily due to computational and memory constraints inherent to field devices. Efficient, lightweight models such as MobileNetV2 offer operational viability but generally underperform in scenarios requiring nuanced, global contextual feature modeling. AgriKD introduces a cross-architecture knowledge distillation (KD) framework, aiming to bridge the representational gap between ViTs and CNNs, bringing transformer-derived global reasoning capabilities to edge-efficient convolutional backbones for leaf disease classification (2605.01355).
Methodology
Cross-Architecture Distillation with Multi-Level Objectives
AgriKD implements a comprehensive distillation objective, designed to transfer multi-faceted knowledge from a ViT-Base teacher (12 layers, 86M parameters, rich global dependencies via self-attention) to a truncated MobileNetV2 student (compact, 2.6M parameters subset). The student truncation aligns the spatial feature map structure (14×14 spatial granularity after the 5th IR block) with ViT patch tokens, facilitating spatially consistent knowledge transfer.
Distillation Components
The total loss combines five terms:
- Cross-entropy loss: Standard classification supervision with hard labels.
- Logit-based distillation: KL divergence between teacher and student softened predictions, leveraging teacher "dark knowledge."
- Relation-based distillation: Correlation structure (inter-instance, inter-class) preservation aligns teacher and student predictive confusion patterns, enforcing higher-level relational consistency.
- Projection 1 (Partially Cross-Attention): Student’s features are mapped into pseudo-attention via convolutional Q/K/V projections, with partial injection of teacher Q/K/V information, encouraging the student to mimic transformer attention in convolutional feature space.
- Projection 2 (Group-wise Linear Projection): Student feature groups are linearly projected and aligned with teacher token embeddings, spatially partitioned, ensuring fine-grained matching of localized representations.
Weights for these losses are heuristically initialized based on per-component validation F1 contribution.
Imbalance Handling
Data-level (Weighted Random Sampling) and loss-level (Focal Loss) imbalance mitigations are selectively applied. Significant impact is observed only for highly imbalanced datasets (Potato), consistent with literature reporting the limited necessity of re-balancing in moderately imbalanced settings.
Training Protocol
Teacher models are fine-tuned/frozen; student models are optimized solely via distillation objectives. Cross-validation and ablation studies ascertain robustness and isolate the effect of each loss term.
Experimental Evaluation
Datasets
Three public datasets spanning varying acquisition conditions and class structures are evaluated: Tomato (10 classes, 1.6K images, ~4:1 imbalance), Burmese Grape (5 classes, 3.1K, ~3.4:1 imbalance), and Potato (7 classes, 3.1K, extreme 11:1 imbalance).
Baseline and Ablation Results
ViT-Base consistently surpasses CNN backbones in both accuracy and F1 across datasets. The distilled MobileNetV2 student, using the full AgriKD objective, matches or exceeds teacher performance on Tomato (+1.25% F1) and Burmese Grape (+1.30% F1), and closely approaches the teacher on Potato (-1.03% F1) while offering a 172× compression in parameter count and >19× inference acceleration.
Ablations confirm:
- Logit-based KD: Principal single-component gain, especially as class granularity increases.
- Relation-based KD: Critical for transferring high-level confusion structure absent in simple logit matching.
- Projection losses: Necessary to align heterogeneous internal representations, but their contribution is auxiliary to prediction-level supervision.
Visual Interpretability
Grad-CAM analyses show the distilled student achieves more coherent, spatially localized activations than both baseline MobileNetV2 and ViT teachers, indicating effective transfer of both global and spatial disease cues.
The distilled model is validated on ONNX, TFLite FP16, and TensorRT FP16 formats. Accuracy and F1 are stable across conversions (Δ<0.2% absolute), while latency drops below 2.2 ms on NVIDIA Jetson Orin Nano and model size falls under 1MB (TFLite), ensuring true real-time viability in edge settings.
Comparative Analysis
On the Potato dataset, AgriKD outperforms previous hybrid and CNN baselines in accuracy and matches or slightly trails larger, less efficient models in F1-score. The superior accuracy-efficiency trade-off is maintained vis-Ã -vis both recent CNN-Transformer hybrids and feature-engineered shallow classifiers.
Implications and Future Developments
AgriKD advances practical, edge-deployable AI for agriculture by unifying global transformer representations with highly efficient CNN inference. The systematic use of multi-level distillation, especially the architectural alignment modules, suggests a generalizable framework across other cross-architecture KD settings, not only for agriculture but for broader vision tasks constrained by compute/resource limitations.
Practical implications include:
- Real-world agricultural deployment: Consistent on-device inference at sub-10 ms latency with minimal resource footprint.
- Generalizable methodology: The cross-architecture distillation protocol can be adapted for other visual application domains requiring deployment efficiency.
- Framework extensibility: Future extensions could incorporate ViT [CLS] tokens, richer spatial token alignment strategies, or alternative student backbones.
Theoretically, the results suggest that student CNNs can internalize global contextual reasoning characteristic of transformers if appropriately guided via multi-level distillation, challenging the dichotomy between convolutional locality and transformer globality. It also invites further exploration of distillation objectives that optimally preserve both fine-grained and holistic visual information.
Conclusion
AgriKD demonstrates that multi-component, cross-architecture knowledge distillation is highly effective for compressing ViT-grade performance into edge-efficient CNNs for fine-grained plant disease recognition. The distilled models offer strong accuracy--efficiency characteristics, robust interpretability, and seamless multi-platform deployment, narrowing the gap between state-of-the-art visual reasoning and practical, field-ready AI solutions. Future work will evaluate the transferability of this paradigm to other tasks, architectures, and real-world agricultural conditions, and further refine representation alignment at both the global and local levels.