Agri-Pest Visual Question Answering

Updated 12 May 2026

Agri-Pest VQA is a specialized subfield of multimodal AI that integrates computer vision, NLP, and agricultural expertise to diagnose crop pests and diseases via visual queries.
It employs structured caption generation, dual VQA pipelines, and LLM-based evaluators to provide transparent, evidence-grounded answers for both pest identification and management.
Evaluation strategies focus on accuracy, chain-of-thought traceability, and domain-specific performance across curated datasets, ensuring reliable and interpretable results.

Agri-Pest Visual Question Answering (VQA) is a subfield of multimodal artificial intelligence focused on the automated interpretation and reasoning over agricultural imagery, particularly for the diagnosis, identification, and management of crop pests and diseases via natural-language queries. These systems integrate computer vision, natural language processing, and agricultural domain knowledge to answer visual questions that require fine-grained understanding of complex, high-variation biological phenomena. State-of-the-art frameworks unify objective morphological analysis, evidence-grounded reasoning, and transparent answer justification, enabling both robust pest recognition and interpretability aligned with expert decision-making protocols (Zhang et al., 31 Dec 2025, Zhang et al., 26 Apr 2026).

1. Core Problem Definition and Task Taxonomy

Agri-Pest VQA systems are presented with field images of crops—often with ambiguous, confounded, or subtle pest and disease symptoms—and must return natural-language answers to domain-specific questions. Key task categories include:

Pest/Disease Identification: Determination of the species or disease affecting a plant or field based on image evidence. This encompasses both direct pest identification (e.g., “Which insect is on the leaf?”) and disease diagnostics (“What is the cause of these leaf spots?”) (Zhang et al., 31 Dec 2025, Zhang et al., 26 Apr 2026, Gauba et al., 14 Apr 2025, Wen et al., 28 Nov 2025).
Symptom/Visual Attribute Grounding: Description and interpretation of morphological traits or visible symptoms, often requiring reasoning about lesion shapes, color gradients, distribution patterns, and plant phenology (Li et al., 7 May 2026).
Management and Etiology QA: Recommendation of evidence-based mitigation or prevention strategies, and explanation of disease cycles or pest behavior inferred from observed symptoms.
Higher-Order Reasoning: Chain-of-Thought (CoT) rationale tracing, including causal reasoning, counterfactual inference, and justification of management advice or identification validity (Wen et al., 28 Nov 2025, Li et al., 7 May 2026).

This taxonomy echoes the cognitive structure underlying agricultural diagnostics and supports both closed-set (classification) and open-ended (generative) QA formulations.

2. Primary Datasets and Benchmarks

Progress in Agri-Pest VQA has been driven by several expert-verified, large-scale multimodal datasets:

Resource	Images (n)	QAs (n)	Focus	Notable Features
CDDMBench	3,000	20 (+diagn.)	Crop-pest/disease; QA	Strict disease classification, factored QA
AgMMU	12,481	10,920	MCQs & OEQs (all crops)	Real-world user-expert dialogues
PlantVillageVQA	55,448	193,609	Fine-grained plant health	QA cognitive levels, expert correction
Agri-3M-VL	1,060,000	2,000,000+	Pest/disease/all crops	Multi-agent VQA generation and refinement
QFSD, AgriInsect	7,054/9,452	MCQ, CoT	Fine-grained pests	Morphology-annotated, chain-of-thought
AgriCoT	4,535	4,535	Reasoning/CoT, all crops	Explicit CoT, multi-step diagnosis

These corpora span image-grounded QA (identification, symptom description, management), expert-generated or refined answers, and structured image–caption–QA triplets, supporting rigorous evaluation across identification, reasoning, and practical recommendation tasks (Zhang et al., 26 Apr 2026, Gauba et al., 14 Apr 2025, Yang et al., 5 Oct 2025, Li et al., 7 May 2026, Wen et al., 28 Nov 2025, Sakib et al., 23 Aug 2025).

3. Model Architectures and Explainable Reasoning Pipelines

Agri-Pest VQA pipelines integrate vision-LLMs (VLMs), structured grounding, and LLM-based evaluators. A prominent paradigm is the Caption–Prompt–Judge (CPJ) framework (Zhang et al., 31 Dec 2025, Zhang et al., 26 Apr 2026), which modularizes the process into:

Structured Caption Generation: A VLM produces an initial objective morphological caption, conditioned by few-shot exemplars and enforced neutrality (no premature disease or pest naming). Captions highlight features such as lesion type, distribution, color gradients, and growth stage:

$C_0 = M_{\mathrm{VLM}}(I, P_{\mathrm{few}})$
Captions refined iteratively using an LLM-as-judge until a quality threshold $\tau$ (e.g., 8.0/10) is met:

$C^* = \begin{cases} C_0, & s(C_0) \geq \tau \ M_{VLM}(I, R(C_0)), & \text{otherwise} \end{cases}$

Caption-Prompted VQA: The refined caption $C^*$ is concatenated with the image and query, grounding the VQA prompt in explicit, audit-traceable evidence. Models generate dual complementary answers:

$\{A^{(1)}, A^{(2)}\} = M_{\mathrm{VQA}}((I,\,C^*,\,Q), P_{\mathrm{task}})$
- $A^{(1)}$ : Identification/classification or management plan.
- $A^{(2)}$ : Grounds the answer in explicit image evidence or etiology.

LLM-as-Judge Selection: An LLM (typically stronger than the VQA generator) scores both answers against multi-dimensional rubrics (accuracy, factual grounding, completeness, scientific validity), selecting the best and emitting a rationale:

$\mathrm{Score}(A) = \frac{1}{|\Omega|} \sum_{\omega\in\Omega} g_{\omega}(A, A_\mathrm{ref})$

$A^* = \arg\max_{A \in \{A^{(1)},A^{(2)}\}} \mathrm{Score}(A)$

This architecture emphasizes human-inspectability, bias mitigation (by suppressing premature species/disease labels in captions), and iterative refinement, with all reasoning steps exposed for validation (Zhang et al., 31 Dec 2025, Zhang et al., 26 Apr 2026).

Parallel lines of work utilize self-consistency prompting, chain-of-thought answer rationales, and reinforcement learning (e.g., Pest-Thinker), where a model learns to maximize a reward based on morphological feature coverage as adjudicated by a frozen LLM judge (Li et al., 7 May 2026).

4. Training Methodologies and Domain Adaptation

Robust Agri-Pest VQA models leverage curriculum training and reward-driven refinement tailored to the challenges of agricultural domains:

Progressive Multimodal Curriculum: Initial model pretraining on domain corpora (agricultural texts, manuals), followed by shallow and deep alignment on large-scale image–caption and image–QA pairs. In AgriGPT-VL, this includes cross-modal contrastive learning ( $L_{\mathrm{ctr}}$ ), supervised QA loss ( $\tau$ 0), and reinforcement via Group Relative Policy Optimization (GRPO), which aligns model outputs with expert-validated preferences (Yang et al., 5 Oct 2025).
Chain-of-Thought Supervision: Injection of stepwise diagnostic rationales—mirroring expert reasoning—directly into supervision targets. Pest-Thinker demonstrates that fine-tuning with chain-of-thought (CoT) examples and LLM-judged feature rewards significantly improves both in-domain and out-of-domain pest recognition accuracy (SFT + GRPO yields +7.7 pp in-domain and +8.5 pp out-of-domain over SFT only) (Li et al., 7 May 2026).
Retrieval-Augmented Generation: Incorporation of external knowledge (AgBase-200K, curated symptom/treatment corpora) into the answer generation process via memory retrieval or prompt augmentation, mitigating hallucination and boosting factual accuracy (Gauba et al., 14 Apr 2025).

Notably, key datasets such as Agri-3M-VL benefit from multi-agent sample refinement and expert curation, and benchmark results underscore significant performance gains (AgriGPT-VL achieves pest VQA accuracy of 85.8% versus baselines of 83.1%; zero-shot open-source VLMs lag proprietary models by 10–15 points on complex tasks) (Yang et al., 5 Oct 2025, Wen et al., 28 Nov 2025).

5. Evaluation Strategies and Metrics

Rigorous evaluation is conducted across closed-set accuracy, generative answer quality, multi-step reasoning, and interpretability:

Accuracy and Consistency: Disease and pest classification accuracy, multi-answer consistency, and cross-modal coherence (strict keyword matching, Acc⁺ for correct multi-part answers) (Zhang et al., 31 Dec 2025, Zhang et al., 26 Apr 2026, Yang et al., 5 Oct 2025, Gauba et al., 14 Apr 2025).
Open-ended QA Metrics: F1, BLEU, METEOR, and ROUGE-L for text answers; normalization of scoring for human evaluation (e.g., QA score: $\tau$ 1) (Zhang et al., 31 Dec 2025).
LLM-as-a-Judge and Auditability: Adoption of LLM-based evaluation to adjudicate factuality, completeness, and practicality. Reports itemize which criteria tipped the answer's selection, providing both numerical scores and qualitative rationales (Zhang et al., 26 Apr 2026, Zhang et al., 31 Dec 2025).
Chain-of-Thought Quality: ROUGE-L F1 and BERTScore F1 are used to quantify semantic and sequential fidelity of generated reasoning chains relative to expert-annotated CoT references (Wen et al., 28 Nov 2025).
Expert Agreement: Direct comparison against human judgments; e.g., CPJ frameworks report 94.2% agreement (Cohen’s κ = 0.88) with human experts (Zhang et al., 31 Dec 2025).

Error analysis reveals dominant failure modes: knowledge gap (misidentification of visually similar species), perceptual errors under challenging imaging conditions, incomplete CoT reasoning, and over-reliance on generic recommendations. These findings guide further refinement of both data and models (Gauba et al., 14 Apr 2025, Wen et al., 28 Nov 2025).

6. Interpretability, Transparency, and Best Practices

Interpretability is central in Agri-Pest VQA due to the high stakes of agricultural decision-making. Frameworks such as CPJ and Pest-Thinker expose:

Evidence Trails: Neutral, objective captions and explicit CoT rationales trace the line from symptom observation through reasoning to final answer, supporting auditable, corrigible output (Zhang et al., 31 Dec 2025, Li et al., 7 May 2026).
Audit Trails: Judge rationales, caption content, and candidate answers are surfaced in structured outputs. Practitioners can pinpoint error origins—caption observation, answer logic, or judge selection—for targeted correction or override (Zhang et al., 26 Apr 2026).
Bias Mitigation: Suppression of premature species/disease references in early captioning stages avoids confirmation bias and model-induced hallucination (Zhang et al., 31 Dec 2025, Zhang et al., 26 Apr 2026).
Chain-of-Thought Exposure: Stepwise reasoning is made available for expert review, supporting both diagnosis validation and rapid improvement through iterative feedback (Wen et al., 28 Nov 2025).

Best practice recommendations include two-stage template and expert-informed dataset generation, few-shot CoT prompt engineering, knowledge base integration via retriever mechanisms, group-normalized RL for stability, and the use of expert-in-the-loop workflows for ongoing dataset and model validation (Sakib et al., 23 Aug 2025, Yang et al., 5 Oct 2025, Li et al., 7 May 2026, Zhang et al., 26 Apr 2026).

7. Open Challenges and Future Directions

Despite empirical advances, several fundamental challenges persist:

Domain Shift Robustness: Models trained on canonical datasets may degrade under unseen field conditions, rare pests, or novel symptom co-occurrences, necessitating continual domain adaptation and expanded phenotype coverage (Zhang et al., 31 Dec 2025).
Visual Ambiguity and Confusion: Fine-grained pest/disease differentiation remains difficult, especially under image occlusions or poor lighting. Integrating targeted detectors or segmentation modules is proposed to address this (Wen et al., 28 Nov 2025, Gauba et al., 14 Apr 2025).
Reasoning Gaps: Even state-of-the-art VLMs trail expert-level reasoning by 10–15 points in answers and more in CoT explication, highlighting the need for larger, more diverse CoT datasets and advanced reasoning architectures (Wen et al., 28 Nov 2025).
Auditability in Deployment: Ensuring that reasoning chains and evidence trails are both accessible and meaningful to practitioners, especially for high-throughput or real-time field applications, remains an implementation hurdle.

A plausible implication is that further integration of domain-specific detectors, retrieval-augmented LLMs, and large-scale CoT supervision can close the expert-model reasoning gap while maintaining interpretability. Advances in modular pipeline composition (e.g., CPJ, Pest-Thinker) and cross-disciplinary evaluation protocols are likely to drive progress toward trustworthy, field-ready Agri-Pest VQA systems (Zhang et al., 31 Dec 2025, Wen et al., 28 Nov 2025, Li et al., 7 May 2026).