SAGE: Scalable Agentic Grounded Evaluation for Crop Disease Diagnosis

Published 10 May 2026 in cs.MA | (2605.09768v1)

Abstract: Plant disease diagnosis is critical for food security, yet training disease-recognition models that generalize across crops, pathogens, and field conditions remains challenging because labeled disease images are far less abundant and standardized than data for other biotic stresses such as insects or weeds. Frontier vision-LLMs offer new opportunities through improved visual reasoning, but they still struggle with fine-grained disease identification due to the lack of structured, crop-specific symptom knowledge. To address this gap, we curate the largest plant disease image--symptom dataset to date, covering 335 crops, 1{,}251 disease classes, and approximately 839K images, designed to support training-free, agentic disease prediction. A scalable automated pipeline generates source-grounded symptom descriptions in which each claim is linked to a verbatim web quote; domain experts validate sampled crops and reconcile disease-name variants across sources. As a baseline, we introduce an autonomous visual reasoning agent that identifies anatomical context, narrows candidate diseases using symptom knowledge, sequentially compares reference images, and produces a fully explainable reasoning trace. Incorporating symptom knowledge improves accuracy by 16.2 percentage points on average at the full reference budget, with consistent gains across all four evaluation crops. Because the framework only requires crop-specific reference images and symptom knowledge, it can be extended to new crops without retraining, while the agentic baseline can directly benefit from future improvements in foundation model capabilities. Dataset and code are available at:https://sage-dataset.github.io/.

Abstract PDF Upgrade to Chat

Authors (10)

Summary

The paper presents a training-free, agentic system that utilizes anatomical indexing and source-verifiable knowledge for accurate crop disease diagnosis.
It details a scalable pipeline that curates the largest multi-organ crop disease image dataset with structured symptom metadata and expert audits.
Experimental results demonstrate significant accuracy gains, with up to a 16.2 percentage point improvement through effective knowledge integration and reference guidance.

SAGE: Scalable Agentic Grounded Evaluation for Crop Disease Diagnosis

Motivation and Contributions

Crop disease recognition presents unique challenges due to the scarcity and heterogeneity of labeled images, class imbalance, and the absence of structured, source-verifiable symptom knowledge. Classical deep learning architectures and vision-LLMs (VLMs) have demonstrated strong performance on narrow, controlled benchmarks, yet generalization across crop species, pathogens, and variable field conditions remains problematic. The lack of explainability and the inability to ground predictions in authoritative domain knowledge further limit practical adoption.

The "SAGE: Scalable Agentic Grounded Evaluation for Crop Disease Diagnosis" (2605.09768) addresses these gaps through a multi-pronged approach:

Curation of the largest plant disease image–symptom dataset to date (839K images, 335 crops, 1,251 disease classes), integrating multi-organ coverage and source-grounded symptom metadata.
Automated pipeline for constructing disease registries with per-claim provenance, linking structured symptom descriptions to verbatim web quotes with field-level expert audit.
An agentic, training-free diagnostic system that operates via sequential visual reasoning, utilizing anatomical indexing and symptom knowledge to guide structured hypothesis narrowing and reference-based comparison with auditable reasoning traces.

The framework enables rapid extensibility to new crops and diseases, eschewing retraining, and delivers consistent accuracy improvements via knowledge base (KB) integration.

Dataset Construction and Knowledge Base

The SAGE dataset is assembled from diverse source categories, including:

Standard benchmarks (PlantVillage, PlantDoc, LeafNet, PlantWild).
Recent large-scale vision-language datasets.
Expert-curated datasets for critical crops with multi-organ labeling.
Community-contributed sets addressing underrepresented crops.

Images undergo deduplication and anatomical tagging. Mislabeling, noise, and inconsistent disease nomenclature are resolved by domain experts, followed by filtering via vision-LLMs against KB-grounded symptom descriptions. Each disease entry merges canonical class labels and organ tags (leaf, stem, root, seed, pod, etc.), supporting anatomical indices for candidate narrowing at inference.

Figure 1: Modular pipeline converting web documents into source-cited knowledge bases, filtering images with domain guidance and splitting into reference/test sets. Agentic evaluation is performed via a reasoning loop informed by anatomy and symptoms.

The disease registry construction pipeline issues targeted web queries, extracts structured claims (pathogen, organ, symptoms), and attaches verbatim source quotes, explicitly avoiding hallucinated knowledge via extract-only prompts. Field-level provenance permits expert audits and direct verification, achieving 70–90% agreement on symptom claims across crops.

Figure 2: Source distribution for KB claims per crop, with dominant contributions from US extension publications, complemented by international datasheets and multi-university networks.

Agentic Diagnostic Pipeline

The diagnostic agent operates in a multi-turn loop, receiving:

Test image.
Reference images (organized by disease class and organ).
Symptom KB and anatomical index (mapping organs to diseases).
Candidate disease list and reference viewing budget $k$ .

Initial steps involve anatomical context deduction and symptom extraction from the test image. Candidate narrowing proceeds first by anatomical index, then by symptom description ranking. Reference images are viewed sequentially (not in parallel), with per-comparison reasoning interrogating visual similarity and systematically rejecting incompatible candidates. Final prediction is generated via visual similarity, supported by KB guidance, and accompanied by a reasoning trace detailing viewed references, elimination steps, and justification.

This chain-of-thought, audit-trail-based inference paradigm supports full explainability, contrasting opaque single-pass classification workflows.

Experimental Results

Evaluation was conducted on four crops (Soybean, Corn, Tomato, Mango), spanning 25–30 disease classes (except Mango, with four). Reference budgets $k$ were varied (0, 1, 4, 8), and KB integration was toggled. Model tiers included Claude Haiku, Sonnet, and Opus.

Key results:

KB integration (no references, $k=0$ ) yielded +14.1 pp accuracy gain on Soybean and +9.1 pp on Corn relative to baseline agent.
With reference budget $k=8$ and KB, Tomato reached 76.1% accuracy vs 67.0% without KB.
Mean improvement across all crops at $k=8$ was +16.2 pp over baseline.
KB impact was most pronounced in high-class-count crops; accuracy saturated on low-class-count crops where visual cues sufficed.
Figure 3: Cost-accuracy tradeoff across model tiers and reference budgets; accuracy increases in tandem with reference budget and model quality, with diminishing returns beyond moderate $k$ .

Figure 4: Confusion matrices for Soybean. KB and reference-guided agent (right) achieves better class separation, reducing systematic misclassifications.

Explainability and Expert Validation

Each diagnostic trace provides an audit trail: anatomical narrowing, symptom-guided candidate ranking, sequential reference comparison, and justification for rejection or acceptance. This transparency is validated via field-level expert audits on KB-sourced claims (Figure 5).

Figure 5: Expert audit agreement rates for KB-sourced symptom claims across crops. Disagreement is concentrated in fine-grained symptom distinctions.

Theoretical and Practical Implications

The SAGE framework demonstrates that:

Training-free, agentic chain-of-thought reasoning, grounded in source-verifiable structured knowledge, enables scalable and extensible crop disease diagnosis with explainability and accuracy gains.
The pipeline supports continuous improvement as foundation models evolve, requiring neither retraining nor additional annotation for new crops.
Accuracy improvements are contingent on both knowledge integration and reference-based comparison; optimal gains are realized in challenging, fine-grained class regimes.

The practical utility spans breeding pipelines, extension scouting, and deployment in field scenarios where explainability and rapid extensibility are critical.

Figure 6: Diagnostic accuracy curves as reference budget $k$ increases, with and without internet KB. KB effects compound with increasing $k$ .

Future Directions

Areas for further research include:

Expanding registry coverage to crops documented in non-English web pages, improving global applicability.
Reducing inference costs (API and compute) for smartphone-level field deployment.
Advanced agent designs—handling multiple co-occurring diseases, stage-aware symptom differentiation, and batch inference protocols.
Integrating broader domain benchmarks and challenging reasoning datasets (AgMMU, AgroBench, etc.).

Conclusion

The SAGE framework delivers agentic, source-grounded diagnosis at scale, outperforming single-pass baselines in accuracy and offering full traceability for predictions. The modular pipeline and open dataset lay the groundwork for extensible reasoning architectures in agricultural AI, enabling robust, explainable disease classification without per-crop supervised training. The methodology augments existing vision-language paradigms, prioritizing transparency, extensibility, and practical deployment.

Markdown Report Issue