Caption–Prompt–Judge Paradigm

Updated 12 May 2026

The CPJ paradigm is a modular, multi-stage framework that decomposes AI tasks into structured captioning, prompt generation, and LLM-based judging.
It operationalizes transparency and auditability by exposing explicit intermediate reasoning steps through standardized textual artifacts.
Empirical results in text-to-image and agricultural VQA tasks demonstrate significant gains in performance and evaluation robustness.

The Caption–Prompt–Judge (CPJ) paradigm is a modular, multi-stage framework designed to decompose generative, diagnostic, or evaluative AI tasks into three inspectable phases: structured captioning, prompt-based candidate generation, and automated answer selection via an LLM-based judge. Originating in the context of both text-to-image generation and interpretable agricultural visual question answering (VQA), CPJ operationalizes transparency, controllability, and auditability through explicit intermediate artifacts and a dataflow that exposes each reasoning step for human inspection. This approach is now central in state-of-the-art explainable AI pipelines, enabling rigorous evaluation, prompt adherence improvements, and domain-specific audit trails (Merchant et al., 7 Jul 2025, Zhang et al., 31 Dec 2025, Zhang et al., 26 Apr 2026).

1. CPJ Framework Structure and Workflow

CPJ pipelines are defined by a sequential tripartite structure:

Caption: A large vision-LLM (VLM) or LLM generates a structured, multi-faceted textual description (caption) of the input (e.g., image or scenario), typically using a canonical template to ensure coverage, comprehensiveness, and semantic disentanglement. For example, in Re-LAION-Caption 19M, the caption template is fixed as four bullet-pointed sentences detailing subjects, setting, aesthetics, and camera perspective, enforcing invariance under meaning-preserving permutations (Merchant et al., 7 Jul 2025). In agricultural diagnosis, morphological captions are produced, describing architecture, lesion morphology, and uncertainty, while withholding diagnostic labels (Zhang et al., 26 Apr 2026, Zhang et al., 31 Dec 2025).
Prompt: The structured caption forms the core content for a purpose-built prompt, which guides a candidate generative or VQA model. The prompt construction strategically leverages the information content, with dual-viewpoint prompting (e.g., disease-focused and crop-focused answers), and context-dependent rules that condition the type and focus of response (Zhang et al., 26 Apr 2026, Zhang et al., 31 Dec 2025).
Judge: Each candidate output (answer, generated sample) is evaluated by an LLM-based judge using explicit, multi-criteria rubrics relevant to the domain task (factual correctness, completeness, specificity, format adherence, etc.). The judge selects the best candidate (or, in some variants, scores alignment), emitting a machine-readable selection and a human-auditable rationale. In text-to-image pipelines, a VQA-based judge quantifies text-image alignment (Merchant et al., 7 Jul 2025), while in agricultural VQA, a text-only judge assesses factual merit and clarity (Zhang et al., 26 Apr 2026, Zhang et al., 31 Dec 2025).

The following pseudocode encapsulates the CPJ process (see (Zhang et al., 26 Apr 2026)):

C = VLM_Generate(I, prompt_template)
while CaptionScore(C) < threshold:
    feedback = Judge_Caption(C)
    C = VLM_Generate(I, feedback)

A1 = VQA_Generate(I, C, Q, viewpoint1)
A2 = VQA_Generate(I, C, Q, viewpoint2)

s1 = JudgeScore(A1, Q, C)
s2 = JudgeScore(A2, Q, C)
Astar = A1 if s1 >= s2 else A2
R = JudgeRationale(A1, A2, Astar)

2. Structured Captioning: Formalisms and Methodology

Structured captioning plays a pivotal role in CPJ by constraining and explicating the information that underpins candidate generation. In the Re-LAION-Caption 19M pipeline (Merchant et al., 7 Jul 2025), LLaVA-Next (Mistral 7B Instruct backbone) generates captions with the following canonical structure:

Subjects or objects (including actions)
Location and setting
Image aesthetics
Camera perspective (angle, framing, focal point)

This slot-based template is encoded as:

$C_\mathrm{template}(u) = "1. \phi_1(u) \; 2. \phi_2(u) \; 3. \phi_3(u) \; 4. \phi_4(u)"$

where group-theoretic invariance ensures the model is not required to learn sentence ordering invariance, focusing instead on the semantic map. In agricultural pipelines, captions are scored on multi-dimensional criteria (accuracy, completeness, detail, relevance, clarity), and refined iteratively via targeted LLM feedback until a quality threshold $\tau$ is met (Zhang et al., 26 Apr 2026, Zhang et al., 31 Dec 2025):

$s(C) = \frac{1}{k}\sum_{i=1}^k w_i d_i(C) \quad \text{with} \quad w_i = 1$

and if $s(C) < \tau$ , a critique is issued and the caption is regenerated, converging typically within 1–2 iterations.

3. Prompt-Oriented Candidate Generation

Once captioning is complete, the structured text feeds into a prompt that elicits candidate responses from a generative or VQA model. In text-to-image pipelines, captions directly substitute or augment user prompts as textual conditioning for diffusion models, with context-window adaptations depending on model-specific tokenizers (e.g., Flan-T5 vs. CLIP in PixArt-E and Stable Diffusion) (Merchant et al., 7 Jul 2025). For agricultural diagnosis, the prompt combines the refined caption, the user question, and perspective-diverse answer instructions:

Disease classification: answers focusing separately on disease (symptoms, severity) and crop (morphology, traits)
Knowledge QA: answers organized by treatment/management and etiology/lifecycle

The CPJ process constructs two complementary answer drafts, exposing model uncertainty and surfacing candidate evidence.

4. LLM-as-Judge Evaluation, Selection, and Stability

The Judge module applies domain-specific rubrics to score or select among candidates. Scoring functions in agricultural CPJ pipelines average over criteria relevant to the subtask (plant, disease, symptom accuracy; completeness; etc.):

$\mathrm{Score}(A) = \frac{1}{|\Omega|}\sum_{\omega \in \Omega} g_\omega(A, A_\mathrm{ref})$

The final selection is:

$A^* = \arg\max_{A \in \{A^{(1)}, A^{(2)}\}} \mathrm{Score}(A)$

Accompanying textual rationales serve as a human-interpretable audit trail linking the selected answer to explicit observations in the caption.

Stability and reliability of Judge modules are quantified using the Judge Sensitivity Score (JSS), which measures verdict consistency across semantically equivalent prompt paraphrases (Bellibatlu, 26 Apr 2026):

$\text{JSS}(j,t) = \frac{1}{|P|} \sum_{i=1}^{|P|} \delta(j(p_i), j(p_i'))$

where $j$ is the judge model and $P$ is the set of paraphrase pairs. High JSS indicates robustness of the judge verdicts against surface-level prompt changes, which is critical in CPJ pipelines to avoid evaluation noise. Empirical analysis reveals coherence tasks expose the largest JSS variance among current LLMs, with model scale decoupled from consistency (Bellibatlu, 26 Apr 2026).

5. Empirical Results and Quantitative Impact

The CPJ paradigm has delivered measurable gains across both image synthesis and domain VQA tasks.

Text-to-Image Generation: Fine-tuning diffusion models (PixArt-E and Stable Diffusion 2) with structured, four-part captions yields consistent improvements in text-image alignment (mean VQA “yes” probability)—~0.7–1.1 point increases—across both LLaVA-based and InstructBLIP judges. Shuffling caption sentences degrades performance, confirming the utility of canonical structure (Merchant et al., 7 Jul 2025).

Model	Captions	VQA (LLaVA)	VQA (InstructBLIP)
PixArt-E	Structured	0.8630	0.8327
PixArt-E	Shuffled	0.8563	0.8303
Stable Diffusion	Structured	0.8120	0.8010
Stable Diffusion	Shuffled	0.8010	0.7905

Agricultural Diagnosis: Introducing caption refinement and judge selection boosts disease classification on CDDMBench from 11.0% to 33.7% (+22.7 pp) and QA scores from 65.0 to 84.5 (+19.5 points) when using GPT-5-Nano with GPT-5-mini captions (Zhang et al., 26 Apr 2026, Zhang et al., 31 Dec 2025). Ablation shows that structured captions are the dominant contributor to these gains.

6. Interpretability, Auditability, and Best Practices

A significant merit of CPJ pipelines is their support for explicit reasoning audit trails. At every stage—caption, candidate answer, judge rationale—human practitioners can inspect, critique, or update components without retraining or altering model weights. This modularity aids error localization (e.g., ambiguous symptoms, misclassified crops) and tailoring model behavior to expert feedback.

Best practices emerging from the literature include:

Enforce high JSS (≥ 0.8) on chosen judges to avoid evaluation noise in CPJ pipelines (Bellibatlu, 26 Apr 2026).
Avoid prompt-polarity ambiguities and position bias in pairwise (A/B) tasks.
Prefer slot-based, canonical caption templates to minimize capacity wastage on sentence order invariance (Merchant et al., 7 Jul 2025).
Leverage multi-prompt judge ensembles for robustness in scoring and verdicts.

7. Extensions and Future Directions

Published work suggests several avenues for advancing the CPJ paradigm:

Alternate canonical caption templates (e.g., “action–object–style–lighting”) and modality transfer (text-to-video).
Integrating paraphrase-based stability objectives into judge model instruction tuning.
Extending CPJ workflows beyond agriculture and image synthesis, e.g., document understanding and biomedical evaluation.
Systematic quantification of perceived controllability and human–machine alignment gains derived from audit-trail transparency.

By codifying and modularizing every decision and reasoning step, the CPJ paradigm provides a scalable template for constructing explainable, robust, and auditable AI systems across a diverse range of generative and evaluative modeling domains (Merchant et al., 7 Jul 2025, Zhang et al., 31 Dec 2025, Zhang et al., 26 Apr 2026, Bellibatlu, 26 Apr 2026).