Caption–Prompt–Judge (CPJ) Framework

Updated 5 January 2026

CPJ is a multimodal reasoning paradigm that decomposes vision-language tasks into captioning, prompting, and judge evaluation for increased transparency and performance.
It employs an iterative process where the judge model refines captions and VQA responses, thereby reducing hallucinations and ensuring adherence to explicit criteria.
Originally applied in agricultural pest diagnosis, CPJ has demonstrated significant improvements in classification and QA scores under data-scarce conditions.

The Caption–Prompt–Judge (CPJ) framework is a structured paradigm for integrating vision-language modeling components—captioning, prompting, and evaluation by a judge model—to improve explainability, robustness, and performance in vision-language tasks, particularly under domain shift and low labeled data settings. Originally deployed for explainable agricultural pest diagnosis (Zhang et al., 31 Dec 2025), the CPJ paradigm generalizes to open-set visual question answering, image captioning, text-to-image evaluation, and beyond.

1. Definition and Architectural Overview

CPJ decomposes multimodal reasoning into three serial modules:

Caption: A vision-LLM generates explicit, structured captions describing salient visual observations from the input image.
Prompt: The caption, image, and task-specific natural-language prompt are concatenated to form the input for the downstream VQA or generation module.
Judge: An LLM-as-Judge evaluates either generated captions or VQA answers (often both) via explicit, multi-dimensional criteria such as factuality, completeness, neutrality, and actionable value.

This architecture is typically training-free or uses few-shot prompting, leveraging powerful frozen backbones (e.g., GPT-5-mini, Qwen2.5-VL-72B) and isolating each module for control and interpretability. CPJ's workflow supports iterative refinement: the Judge can instruct the Captioner to revise outputs until quality thresholds are met.

2. Motivation and Design Principles

CPJ was introduced to address several challenges in domains requiring robust, explainable visual diagnosis:

Opacity of black-box models: Traditional CNN-based detectors and transformers (YOLO, ViT) output only class labels with no evidentiary support, prone to unreliable inference across data domains.
High annotation and fine-tuning costs: Domain transfer for specialized VQA (e.g., in agriculture) demands significant labeled data and supervised adaptation, yielding models that are still opaque.
Missed subtlety in VLMs: Generic open-domain VLMs, when queried for answers directly, fail to attend to domain-specific, fine-grained cues without explicit guidance.

CPJ inserts a structured explanation layer—forcing neutral, human-readable captions—between raw perception and VQA, and employs an LLM-Judge for self-consistent iterative improvement, annotation debiasing, and answer selection (Zhang et al., 31 Dec 2025, Hu et al., 2022).

3. Caption Module: Multi-Angle, Debiased Caption Generation

The Caption module in CPJ generates “explanational” captions intended to act as evidence chains. Key design elements (Zhang et al., 31 Dec 2025) include:

Multi-angle coverage: Captions must jointly describe plant morphology, symptom characteristics (e.g., lesion shapes, necrosis), severity, and uncertainties. Prompts explicitly forbid naming crops/diseases, ensuring observational neutrality.
Iterative LLM-as-Judge selection: A strong LLM scores each candidate on accuracy, completeness, and neutrality ( $s(C)$ ), accepting the caption iff $s(C) \geq \tau$ (e.g., $\tau=8.0$ ). Otherwise, the LLM issues refinement instructions, and the Caption module regenerates until convergence:

$C^* = \begin{cases} C_0, & \text{if } s(C_0) \geq \tau \ M_{VLM}(I, R(C_0)), & \text{otherwise} \end{cases}$

Pseudocode for this loop:

def GenerateCaption(I, P_few, tau):
    C = VLM.generate(I, P_few)
    score = JudgeLLM.score_caption(C)
    while score < tau:
        R = JudgeLLM.refine_instructions(C)
        C = VLM.generate(I, R)
        score = JudgeLLM.score_caption(C)
    return C

This procedure enforces consistency, minimizes hallucinations, and grounds downstream answers in observable evidence.

4. Prompt and Answer Modules: VQA Input Construction and Dual-Answer Generation

Prompt construction: The refined caption $C^*$ is concatenated with the image and a task instruction (e.g., “Identify crop and disease”, “Suggest actionable management”) to form the VQA input $X = (I, C^*, Q)$ . Carefully templated prompts are used to elicit concise, fact-based recognition or farmer-oriented, pragmatic treatment recommendations.
Dual-answer VQA: For each $X$ $X$ , the VQA model produces two complementary drafts:
- $A^1$ : Recognition (classification, diagnostic explanation)
- $A^2$ : Knowledge/management (treatment, prevention, life cycle)
- Candidate answers are then scored by the Judge using a weighted average over explicit criteria $\Omega$ (plant accuracy, disease accuracy, completeness, format adherence, specificity, practicality):

$\text{Score}(A) = \frac{1}{|\Omega|} \sum_{\omega \in \Omega} g_\omega(A, A_{ref})$

The best answer $A^*$ is selected as $\arg \max_{A \in \{A^1, A^2\}} \text{Score}(A)$ , and a rationale report is returned.

The Judge serves two tightly coupled functions:

Caption refinement: Iteratively evaluates candidate captions, directing revisions until quality meets threshold $\tau$ .
Answer selection: For dual VQA outputs, enforces rigorous selection via explicit criteria, filtering hallucinations, enforcing format, and issuing human-readable justifications for each selection.

The selection process is formalized:

def SelectBestAnswer(A_list, A_ref, Omega):
    best_score = -inf
    best_answer = None
    for A in A_list:
        total = 0
        for omega in Omega:
            total += JudgeLLM.score_criterion(omega, A, A_ref)
        score = total / len(Omega)
        if score > best_score:
            best_score = score
            best_answer = A
    return best_answer, best_score

No gradient-based supervision is used: all models are accessed via API calls. Metrics are keyword-matching accuracy for classification and LLM-normalized scores for QA (10-point scale per criterion, aggregated to 100).

6. Empirical Results and Ablation Studies

On CDDMBench, CPJ with GPT-5-mini (captioner) and GPT-5-Nano (VQA, Judge) demonstrates substantial empirical gains (Zhang et al., 31 Dec 2025):

Disease classification: +22.7 percentage points (33.7% vs. 11.0% no-caption baseline)
QA score: +19.5 points (84.5 vs. 65 baseline)
Ablation: Unoptimized captions offer moderate boosts (+6–19 pp in classification); iterative refinement and judge-based answer selection each contribute +1–3 pp separately.

Method	Disease Cls (%)	QA Score
No captions	11.0	65
+Captions (no judge)	31.6	84
+Few-shot+Judge	33.7	84.5

Iterative judge-guided caption refinement is essential for filtering hallucinations and enforcing evidentiary chains.

7. Explainability, Traceability, and Limitations

The CPJ design enforces a transparent evidence chain for every output:

Step 1: Image $I$ → structured caption $C^*$
Step 2: $(I,C^*,Q)$ → dual answer drafts ( $A^1$ , $A^2$ )
Step 3: Judge LLM scores answers, selects $A^*$ , and returns justification.

Worked example for bacterial leaf spot:

Caption: “Necrotic circular lesions ~3 mm diameter surrounded by diffuse yellow halos; slight leaf curling; background light glare.”
Candidates for recognition prompt:
- $A^{1.1}$ : “Pepper; bacterial leaf spot; symptoms as described.” (score: 0.93)
- $A^{1.2}$ : “Bell pepper; early bacterial blight; concentric ring pattern.” (score: 0.85)
- Judgement and rationale are provided per answer.

Limitations include susceptibility to image quality artifacts, judge verbosity bias, and inability to model disease progression or integrate multi-view/temporal evidence. Extensions to external knowledge grounding and uncertainty calibration are proposed as future work.

8. Broader Context, Relation to Other CPJ Pipelines, and Outlook

CPJ's structure has influenced general VQA pipelines (Hu et al., 2022), captioning for machine translation (Betala et al., 10 Nov 2025), and automated judgment calibration (Slyman et al., 10 Sep 2025), as well as prompt-based unpaired captioning (Zhu et al., 2022) and retrieval-based prompt construction (Gu et al., 10 Aug 2025).

PromptCap (Hu et al., 2022) applies CPJ to knowledge-based VQA, demonstrating that controlled caption prompts targeting entities needed for question answering drive higher end-task accuracy.
Vision-guided judge-corrector pipelines extend CPJ to multimodal machine translation error correction, routing ambiguous instances to LLM correctors or retranslation (Betala et al., 10 Nov 2025).
Calibration of multimodal LLM judges via Bayesian prompt ensembling is critical for trustworthy, content-aware evaluation within CPJ workflows (Slyman et al., 10 Sep 2025).
CPJ-style pipelines have been shown to deliver SOTA results under resource constraints and minimal supervision, validating their potential for deployment in domains beyond agriculture, including low-resource language translation and open-domain VQA (Zhang et al., 31 Dec 2025, Hu et al., 2022, Betala et al., 10 Nov 2025).

CPJ establishes a paradigm wherein explainability, debiasing, and performance are advanced through modular decomposition of multimodal reasoning and systematic LLM-based adjudication. Its abstractions—structured captioning, task-aware prompting, and judge-driven multi-criteria evaluation—are extensible to emerging VLM architectures and downstream multimodal tasks.