Papers
Topics
Authors
Recent
Search
2000 character limit reached

CPJ Framework for Pest & Disease Diagnosis

Updated 12 May 2026
  • CPJ framework is a modular system that decomposes diagnosis into captioning, prompting, and judging stages, ensuring clear and auditable reasoning.
  • It leverages few-shot prompts and iterative caption refinement to extract unbiased, morphology-focused descriptions from agricultural images.
  • The streamlined pipeline significantly improves diagnostic accuracy, expert audit alignment, and overall explainability on agricultural VQA benchmarks.

The Caption–Prompt–Judge (CPJ) framework is a training-free, few-shot architecture for interpretable agricultural pest and disease diagnosis from images. CPJ decomposes the diagnostic workflow into three sequential and auditable stages: morphology-driven captioning, prompt-based answer generation, and LLM-based answer selection. Unlike traditional end-to-end visual-language pipelines, CPJ externalizes all intermediate reasoning steps, enabling domain experts to inspect, correct, or override system components without updating model parameters. This approach achieves substantial improvement in both accuracy and explainability on challenging agricultural VQA benchmarks (Zhang et al., 31 Dec 2025, Zhang et al., 26 Apr 2026).

1. Pipeline Structure and Theoretical Foundation

CPJ operates as a modular, stateless API pipeline comprising three stages:

  1. Caption (Generative Explanational Captioning): A large vision-LLM (VLM) extracts structured, multi-angle, morphology-focused captions from input images. These captions are iteratively refined via a lightweight LLM judge along multiple quality dimensions until a threshold score is met.
  2. Prompt (Caption-Guided VQA): The refined caption, the original image, and the diagnostic question are input to a VQA model using a few-shot, dual-view prompt. The model generates two complementary candidate answers, targeting distinct reasoning perspectives (e.g., disease recognition vs. crop focus, or management vs. etiology).
  3. Judge (LLM Answer Selection): A more capable LLM evaluates candidate answers using a domain-specific rubric and selects the final output, providing detailed per-criterion scores and a textual justification.

This separation of perceptual observation, domain-specific reasoning, and decision-level quality control addresses both the accuracy–interpretability trade-off and the susceptibility of prior methods to hallucination and opacity.

2. Stage 1: Generative and Refined Morphological Captioning

CPJ’s captioning module emphasizes unbiased, explicit observation. Given an image II and a few-shot prompt PfewP_{\text{few}} containing 2–3 carefully composed (image, caption) examplars, the model produces an initial caption: C0=VLM(I,Pfew)C_0 = \mathcal{VLM}(I, P_{\text{few}})

Captions are constrained to exclude crop or disease names, focusing instead on descriptors such as leaf morphology, lesion geometry and color, and observed ambiguity. Multi-dimensional quality assessment follows, scoring each draft caption along axes such as:

  • Accuracy: Correctness of described visual features
  • Completeness: Breadth of symptom angles covered
  • Specificity: Level of detail and quantitative precision
  • Relevance and Clarity

The LLM judge calculates an aggregate quality score: s(C)=1ki=1kwidi(C),0s(C)1s(C) = \frac{1}{k} \sum_{i=1}^k w_i d_i(C),\quad 0 \leq s(C) \leq 1 where di(C)d_i(C) are normalized dimension scores and wiw_i are their weights.

If s(C0)<τs(C_0) < \tau (empirically, τ0.8\tau \approx 0.8), model-generated, dimension-targeted critique R(C0)R(C_0) guides refinement. The loop runs until an optimized caption CC^* is obtained or capped at PfewP_{\text{few}}0 iterations (Zhang et al., 26 Apr 2026).

3. Stage 2: Dual-Answer, Few-Shot Prompted VQA

This stage generates two candidate answers PfewP_{\text{few}}1 conditioned on distinct interpretative viewpoints. The background prompt concatenates the optimized caption PfewP_{\text{few}}2, the question PfewP_{\text{few}}3, and 2–3 few-shot exemplars demonstrating the enforced dual-answer structure.

For recognition tasks:

  • Answer 1: Disease/pest identification (symptom features, pathogen class, severity)
  • Answer 2: Crop identification (morphological markers, species/variety)

For management/knowledge QA:

  • Viewpoint 1: Treatment protocols and practical actions
  • Viewpoint 2: Etiology or scientific basis

This construction encourages the VQA model (e.g., GPT-5-Nano, Qwen-VL-Chat) to surface complementary rationales rather than collapsing both facets into a singular response. The prompt format and structure ensure alignment with the rubric used in subsequent evaluation (Zhang et al., 31 Dec 2025, Zhang et al., 26 Apr 2026).

4. Stage 3: LLM-Based Answer Selection and Auditing

CPJ’s final stage implements domain-specific, pairwise answer evaluation via a powerful LLM judge. Each candidate answer is scored on a vector of criteria: PfewP_{\text{few}}4 where PfewP_{\text{few}}5 includes criteria contingent on task type (e.g., plant identity accuracy, disease class accuracy, symptom description precision, adherence to format, completeness, specificity, actionability). Each PfewP_{\text{few}}6 is typically valued at 0 (absent), 0.5 (partial), or 1 (fully correct). Ties are broken using task-dependent preferences (e.g., favoring biological specificity in recognition or actionable advice in knowledge QA).

The judge outputs a JSON object reporting:

  • The selected answer index
  • Scores for all criteria
  • A concise, human-readable rationale

This explicit externalization facilitates downstream auditing, transparent error localization, and cross-verifiability with human experts (Zhang et al., 26 Apr 2026).

5. Quantitative Evaluation on Agricultural Benchmarks

CPJ is validated on CDDMBench (open-ended crop/disease classification + knowledge QA) and AgMMU-MCQs (multiple-choice, five sub-tasks), with the following core findings:

Model Configuration Crop % Disease % QA Score
GPT-5-Nano Zero-shot, no caption 47.00 11.00 65.0
+ Optimized caption (GPT-5-mini) 60.30 31.60 84.0
+ Few-shot prompting 58.90 29.80 76.0
+ LLM-Judge 63.38 33.70 84.5

The cumulative effect of CPJ components yields a +22.7 percentage point increment in disease classification and +19.5 in QA score relative to the no-caption baseline when GPT-5-Nano is paired with GPT-5-mini-generated captions. Similar qualitative gains occur for Qwen-VL-Chat. Ablation studies confirm that skipping caption refinement consistently degrades accuracy. Notably, the jump in species recognition with optimized captions (e.g., reaching 92.55% in AgMMU) highlights their essential role for high-morphology tasks (Zhang et al., 31 Dec 2025, Zhang et al., 26 Apr 2026).

6. Interpretability, Auditability, and Domain Robustness

CPJ’s outputs are fully auditable—each pipeline invocation emits the quality-scored caption, two candidate answers with rubric-level granularity, and the LLM judge’s textual justification. Practitioners can immediately validate whether the perceptual extraction matches their field experience; errors can be traced to individual caption sentences or rubric categories, rather than opaque neural activations.

Under domain shift (e.g., different illumination, camera type, unfamiliar crop regions), the training-free, criteria-driven architecture maintains caption quality and answer accuracy with <5% relative loss, owing in part to the external explicitness and reusability of the observation log. The LLM judge effectively mitigates hallucination rates, and human expert audits report 94.2% agreement with the judge’s selection (Cohen’s PfewP_{\text{few}}7), indicating strong alignment (Zhang et al., 31 Dec 2025, Zhang et al., 26 Apr 2026).

7. Implementation and Practical Considerations

All CPJ modules operate via stateless API calls using off-the-shelf VLMs and LLMs (Qwen2.5-VL-72B-Instruct, GPT-5-mini, Qwen-VL-Chat, GPT-5-Nano, GPT-5 LLM-Judge). Hyperparameters such as temperature (0.5), top_p (0.8), and max_tokens (400) are fixed empirically. On average, 1.8 caption–refinement iterations are required to reach quality threshold per sample. Pipeline orchestration is managed by frameworks like LangChain. Since captions can be cached for multiple follow-up questions per image, amortized API utilization remains efficient for practical deployment. The design enables practitioners to run high-stakes diagnostic workflows without fine-tuning, GPU infrastructure, or exposure to model internals (Zhang et al., 26 Apr 2026, Zhang et al., 31 Dec 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Caption–Prompt–Judge (CPJ) Framework.