AgriKD: Cross-Architecture Knowledge Distillation for Efficient Leaf Disease Classification

Published 2 May 2026 in cs.CV and cs.AI | (2605.01355v2)

Abstract: Automated leaf disease classification is critical for early disease detection in resource-constrained field environments. Vision Transformers (ViTs) provide strong representation capability by modeling long-range dependencies and inter-class relationships; however, their high computational cost makes them impractical for deployment on edge devices. As a result, existing approaches struggle to effectively transfer these rich representations to lightweight models. This paper introduces AgriKD, a cross-architecture knowledge distillation framework for efficient edge deployment, which transfers knowledge from a Vision Transformer (ViT) teacher to a compact convolutional student model. To bridge the representational gap between Transformer and CNN architectures, the proposed approach integrates multiple distillation objectives at the output, feature, and relational levels, where each objective captures a different aspect of the teacher knowledge. This enables the student model to better preserve and utilize transformer-derived global representations. Experiments on multiple leaf disease datasets show that the distilled student achieves performance comparable to the teacher while significantly improving efficiency, reducing model parameters by approximately 172 times, computational cost by 47.57 times, and inference latency by 18-22 times. Furthermore, the optimized model is deployed across multiple runtime formats, including ONNX, TFLite Float16, and TensorRT FP16, achieving consistent predictive performance with negligible accuracy degradation. Real-world deployment on NVIDIA Jetson edge devices and a mobile application demonstrates reliable real-time inference, highlighting the practicality of AgriKD for AI-powered agricultural applications in resource-constrained environments.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper introduces a cross-architecture distillation framework that transfers global reasoning from a ViT teacher to a compact MobileNetV2 student.
It demonstrates that using multi-level loss objectives, the distilled CNN can match or exceed teacher performance while reducing parameters by over 170×.
The framework achieves real-time, edge-ready leaf disease classification with sub-10ms latency and robust interpretability through Grad-CAM analyses.

Cross-Architecture Knowledge Distillation for Edge-Efficient Leaf Disease Recognition: An Analysis of AgriKD

Introduction

Automated disease classification in crops is vital for precision agriculture, where timely identification curbs crop loss and mitigates excessive agrochemical use. Deep learning, especially CNNs and more recently ViTs, has established high accuracy on benchmark datasets. However, the deployment of high-capacity models like ViTs is substantially limited in edge environments, primarily due to computational and memory constraints inherent to field devices. Efficient, lightweight models such as MobileNetV2 offer operational viability but generally underperform in scenarios requiring nuanced, global contextual feature modeling. AgriKD introduces a cross-architecture knowledge distillation (KD) framework, aiming to bridge the representational gap between ViTs and CNNs, bringing transformer-derived global reasoning capabilities to edge-efficient convolutional backbones for leaf disease classification (2605.01355).

Methodology

Cross-Architecture Distillation with Multi-Level Objectives

AgriKD implements a comprehensive distillation objective, designed to transfer multi-faceted knowledge from a ViT-Base teacher (12 layers, 86M parameters, rich global dependencies via self-attention) to a truncated MobileNetV2 student (compact, 2.6M parameters subset). The student truncation aligns the spatial feature map structure ( $14\times14$ spatial granularity after the 5th IR block) with ViT patch tokens, facilitating spatially consistent knowledge transfer.

Distillation Components

The total loss combines five terms:

Cross-entropy loss: Standard classification supervision with hard labels.
Logit-based distillation: KL divergence between teacher and student softened predictions, leveraging teacher "dark knowledge."
Relation-based distillation: Correlation structure (inter-instance, inter-class) preservation aligns teacher and student predictive confusion patterns, enforcing higher-level relational consistency.
Projection 1 (Partially Cross-Attention): Student’s features are mapped into pseudo-attention via convolutional Q/K/V projections, with partial injection of teacher Q/K/V information, encouraging the student to mimic transformer attention in convolutional feature space.
Projection 2 (Group-wise Linear Projection): Student feature groups are linearly projected and aligned with teacher token embeddings, spatially partitioned, ensuring fine-grained matching of localized representations.

Weights for these losses are heuristically initialized based on per-component validation F1 contribution.

Imbalance Handling

Data-level (Weighted Random Sampling) and loss-level (Focal Loss) imbalance mitigations are selectively applied. Significant impact is observed only for highly imbalanced datasets (Potato), consistent with literature reporting the limited necessity of re-balancing in moderately imbalanced settings.

Training Protocol

Teacher models are fine-tuned/frozen; student models are optimized solely via distillation objectives. Cross-validation and ablation studies ascertain robustness and isolate the effect of each loss term.

Experimental Evaluation

Datasets

Three public datasets spanning varying acquisition conditions and class structures are evaluated: Tomato (10 classes, 1.6K images, ~4:1 imbalance), Burmese Grape (5 classes, 3.1K, ~3.4:1 imbalance), and Potato (7 classes, 3.1K, extreme 11:1 imbalance).

Baseline and Ablation Results

ViT-Base consistently surpasses CNN backbones in both accuracy and F1 across datasets. The distilled MobileNetV2 student, using the full AgriKD objective, matches or exceeds teacher performance on Tomato (+1.25% F1) and Burmese Grape (+1.30% F1), and closely approaches the teacher on Potato (-1.03% F1) while offering a 172 $\times$ compression in parameter count and $>$ 19 $\times$ inference acceleration.

Ablations confirm:

Logit-based KD: Principal single-component gain, especially as class granularity increases.
Relation-based KD: Critical for transferring high-level confusion structure absent in simple logit matching.
Projection losses: Necessary to align heterogeneous internal representations, but their contribution is auxiliary to prediction-level supervision.

Visual Interpretability

Grad-CAM analyses show the distilled student achieves more coherent, spatially localized activations than both baseline MobileNetV2 and ViT teachers, indicating effective transfer of both global and spatial disease cues.

Deployment and Cross-Format Validation

The distilled model is validated on ONNX, TFLite FP16, and TensorRT FP16 formats. Accuracy and F1 are stable across conversions (Δ $<$ 0.2% absolute), while latency drops below 2.2 ms on NVIDIA Jetson Orin Nano and model size falls under 1MB (TFLite), ensuring true real-time viability in edge settings.

Comparative Analysis

On the Potato dataset, AgriKD outperforms previous hybrid and CNN baselines in accuracy and matches or slightly trails larger, less efficient models in F1-score. The superior accuracy-efficiency trade-off is maintained vis-à-vis both recent CNN-Transformer hybrids and feature-engineered shallow classifiers.

Implications and Future Developments

AgriKD advances practical, edge-deployable AI for agriculture by unifying global transformer representations with highly efficient CNN inference. The systematic use of multi-level distillation, especially the architectural alignment modules, suggests a generalizable framework across other cross-architecture KD settings, not only for agriculture but for broader vision tasks constrained by compute/resource limitations.

Practical implications include:

Real-world agricultural deployment: Consistent on-device inference at sub-10 ms latency with minimal resource footprint.
Generalizable methodology: The cross-architecture distillation protocol can be adapted for other visual application domains requiring deployment efficiency.
Framework extensibility: Future extensions could incorporate ViT [CLS] tokens, richer spatial token alignment strategies, or alternative student backbones.

Theoretically, the results suggest that student CNNs can internalize global contextual reasoning characteristic of transformers if appropriately guided via multi-level distillation, challenging the dichotomy between convolutional locality and transformer globality. It also invites further exploration of distillation objectives that optimally preserve both fine-grained and holistic visual information.

Conclusion

AgriKD demonstrates that multi-component, cross-architecture knowledge distillation is highly effective for compressing ViT-grade performance into edge-efficient CNNs for fine-grained plant disease recognition. The distilled models offer strong accuracy--efficiency characteristics, robust interpretability, and seamless multi-platform deployment, narrowing the gap between state-of-the-art visual reasoning and practical, field-ready AI solutions. Future work will evaluate the transferability of this paradigm to other tasks, architectures, and real-world agricultural conditions, and further refine representation alignment at both the global and local levels.