Judge-and-Refine Pipeline
- Judge-and-refine pipelines are structured methods that separate generation from evaluation, using explicit rubrics to iteratively refine outputs.
- They employ clear judging criteria and scoring mechanisms to assess intermediate artifacts, ensuring improved selection quality and auditability.
- These pipelines are applied across domains such as visual diagnosis, regulatory extraction, educational marking, and workflow optimization to outperform unguided baselines.
A judge-and-refine pipeline is a structured inference or orchestration pattern in which a system first produces an artifact, then evaluates that artifact with an explicit judge, and finally revises, selects, routes, or filters subsequent computation on the basis of the judgment rather than by updating model parameters. Recent work instantiates this pattern in training-free visual diagnosis, hierarchical regulatory rule extraction, preference-chain construction for fine-tuning, block-level workflow optimization, and due-process manuscript revision, indicating that “judge-and-refine” is best understood as a family of pipelines whose central design variable is the object being judged: an intermediate representation, a final answer, a workflow component, a judgment, or an editable artifact (Zhang et al., 26 Apr 2026, Guliani et al., 2 Apr 2026, Cayir et al., 3 Aug 2025, Ma et al., 12 Jan 2026, Wang et al., 15 Jun 2026).
1. Core definition and architectural variants
The defining move in these systems is the explicit separation of generation from evaluation. Instead of letting a single decoding trajectory absorb perception, reasoning, and decision-making, the pipeline externalizes an intermediate or final artifact, subjects it to an explicit rubric, and only then permits downstream action. In Agri-CPJ, the artifact is first a structured morphological caption and then a pair of candidate diagnostic answers; in De Jure, it is a hierarchy of section_meta, definitions, and rule_units; in Refine-n-Judge, it is a sequence of progressively improved responses; in JudgeFlow, it is a workflow trace with block-level responsibility ranks; and in PaperJury, it is a ledger of manuscript issues, verdicts, and guarded patches (Zhang et al., 26 Apr 2026, Guliani et al., 2 Apr 2026, Cayir et al., 3 Aug 2025, Ma et al., 12 Jan 2026, Wang et al., 15 Jun 2026).
| System | Judged artifact | Refinement target |
|---|---|---|
| Agri-CPJ | Caption; dual candidate answers | Caption rewrite; answer selection |
| De Jure | Metadata, definitions, rule units | Stage-specific regeneration |
| Refine-n-Judge | Current answer vs refinement | Accepted next answer |
| JudgeFlow | Failed execution trace by block | Add/Remove/Modify block |
| PaperJury | Issues, verdicts, guarded patches | Anchor-bounded manuscript edits |
Taken together, these systems suggest two recurrent topologies. The first is intermediate-representation refinement, where the pipeline judges a latent-but-readable scaffold before task completion. The second is conclusion refinement, where the system generates multiple terminal candidates and uses a judge to select or reject them. Agri-CPJ makes this distinction explicit by separating “refinement of perception” from “refinement of conclusion” (Zhang et al., 26 Apr 2026).
A further distinction concerns where refinement acts. Some pipelines rewrite the same object iteratively; others retain multiple candidates and select the strongest; others do neither, instead modifying the orchestration substrate itself. JudgeFlow, for example, does not revise an answer but identifies the globally weakest logic block and updates workflow structure through “Add Block,” “Remove Block,” or “Modify Block” actions (Ma et al., 12 Jan 2026). PaperJury likewise refines not a judgment text but the manuscript, under deterministic control of a ledger and exact-once patch application (Wang et al., 15 Jun 2026).
2. Judging mechanisms, rubrics, and scoring formalisms
Judge-and-refine systems are typically defined more by their rubrics than by any single model family. Agri-CPJ’s caption judge scores captions on Accuracy, Completeness, Detail, Relevance, and Clarity, with an overall quality score written as
and a JSON output containing rating, reasoning, and suggestions (Zhang et al., 26 Apr 2026). De Jure expands this principle into three judges and nineteen dimensions: six for section metadata, five for definitions, and eight for rule units, each scored on a 0–5 scale and summarized into a per-stage average used for acceptance or repair (Guliani et al., 2 Apr 2026).
Other systems judge not task answers directly but judgments themselves. The meta-judge selection framework evaluates candidate judgments on seven weighted criteria—Accuracy of Judgment, Logical Soundness, Completeness of Evaluation, Fairness, Relevance to Context, Clarity of Explanation, and Impactfulness—with weights $0.20, 0.20, 0.15, 0.10, 0.15, 0.10, 0.10$, respectively (Li et al., 23 Apr 2025). This makes the judged object reflexive: the pipeline can refine the reliability of a judge by subjecting judgments to a second-order rubric.
In educational marking, rubric construction is itself staged. The curriculum-grounded marking pipeline first maps a question to authorised syllabus artifacts, then generates structured criteria, then calibrates performance bands using glossary definitions, performance band descriptors, and thirteen marking-guideline principles (Xu et al., 16 Jun 2026). Themis similarly relies on scenario-dependent evaluation prompts spanning ten scenarios, with a unified five-tier scale and structured explanations of strengths and shortcomings (Hu et al., 5 Feb 2025). These designs make explicit that judge-and-refine pipelines often depend on prompted institutional context—syllabi, glossaries, statutes, repo structure, or domain conventions—rather than on generic helpfulness criteria alone.
Judging can also be framed as reasoning rather than direct scoring. MR. Judge reformulates multimodal judgment as a multiple-choice task in which the model first produces deliberate reasoning in > ... and then selects a candidate answer in boxed{A/B/C/D} (Pi et al., 19 May 2025). This suggests that some judge-and-refine pipelines are effectively reasoning-first evaluators, where the judge’s main role is not to emit a scalar but to create a structured comparative analysis from which a selection is derived.
3. Refinement operators and control logic
Refinement is usually budgeted, thresholded, and artifact-specific. In Agri-CPJ, an initial caption is rewritten until it exceeds a threshold or reaches , with typical iterations of –$2$ (Zhang et al., 26 Apr 2026). In De Jure, each stage is repaired if and only if its average score falls below , with at most retries and best-scoring retention, yielding “monotonically non-decreasing” quality because worse repairs are discarded (Guliani et al., 2 Apr 2026). Refine-n-Judge instead accepts a refinement only if the judge prefers it over the previous answer and terminates when the new answer is not preferred, with a hard cap of 10 iterations and preference chains of the form
for accepted steps (Cayir et al., 3 Aug 2025).
A separate family uses refinement as structural optimization. JudgeFlow computes the globally weakest block as
$0.20, 0.20, 0.15, 0.10, 0.15, 0.10, 0.10$0
where $0.20, 0.20, 0.15, 0.10, 0.15, 0.10, 0.10$1 is the responsibility rank assigned to block $0.20, 0.20, 0.15, 0.10, 0.15, 0.10, 0.10$2 on failure $0.20, 0.20, 0.15, 0.10, 0.15, 0.10, 0.10$3; it then optimizes that block rather than the whole workflow (Ma et al., 12 Jan 2026). Probe-and-refine tuning applies the same diagnose-then-patch idea to repository guidance: synthetic probes expose missing operational knowledge, a judge suggests edits, and the guidance file is iteratively rewritten under a $0.20, 0.20, 0.15, 0.10, 0.15, 0.10, 0.10$4-character budget (Shepard et al., 18 Jun 2026).
Control logic can be model-centric or orchestration-centric. PaperJury explicitly locates “load-bearing safety and completion logic” in deterministic orchestration rather than in model discretion: decomposition, frozen claim spine, durable ledger, routing, stopping, exact-once patch application, and reverts are all handled in code, while semantic agents are restricted to bounded review, judgment, and repair (Wang et al., 15 Jun 2026). This is a materially different refinement philosophy from Refine-n-Judge, where the same LLM can act as both refiner and judge in different prompting modes (Cayir et al., 3 Aug 2025).
A common misconception is that judge-and-refine is merely chain-of-thought with self-correction. Agri-CPJ explicitly rejects that equivalence: observation is externalized into a separate captioning phase, and downstream reasoning does not begin until the caption passes quality gating (Zhang et al., 26 Apr 2026). The distinction matters because many later gains are attributed not to better final reasoning alone but to better intermediate artifacts.
4. Domain-specific instantiations
In multimodal agricultural diagnosis, the pipeline centers on a structured morphological caption that excludes crop or disease names, followed by dual-view answer generation and LLM-as-a-judge answer selection. The readable caption and the judge rationale together form an audit trail, allowing practitioners to localize disagreement to the caption, the answer, or the judge (Zhang et al., 26 Apr 2026). A related multimodal evaluation framework uses a dedicated Judge Model over text, audio, image, and video, scoring both final answers and justifications and returning a 0–5 score, an error type, and diagnostic feedback (Shih et al., 3 Jan 2026).
In regulatory extraction, judge-and-refine operates over structured semantics rather than raw answers. De Jure normalizes PDFs or HTML into Markdown, decomposes sections into JSON, judges metadata, definitions, and rule units across nineteen criteria, and selectively repairs only deficient fields (Guliani et al., 2 Apr 2026). The result is a machine-readable rule substrate that the paper positions as a step toward regulation-grounded alignment.
Educational systems instantiate the pattern in two notably different ways. The curriculum-grounded marking pipeline first generates question-specific rubrics grounded in authorised curriculum artifacts and only then marks student responses, yielding justifications that are more traceable to syllabus outcomes and marking standards (Xu et al., 16 Jun 2026). REFINE, by contrast, uses a pedagogical feedback generator, an LLM judge that evaluates components of the feedback report against a rubric, and a tool-calling interactive feedback agent for follow-up questions (Fawzi et al., 31 Mar 2026).
Coding-agent variants move the judged object yet again. Probe-and-refine tuning judges synthetic bug-fix attempts in order to improve AGENTS.md-style repository guidance (Shepard et al., 18 Jun 2026). The architectural reasoning pipeline uses an Architecture Complexity Judge (ACJ) to estimate how much repository-specific understanding a task demands and an Architecture Quality Judge (AQJ) to assess patch conformance to source-grounded architectural rubrics, then filters examples for supervised fine-tuning (Vasilevski et al., 12 Jun 2026).
Judge-and-refine also appears in artifact-safe revision and safety defense. PaperJury applies due-process trials, verdict classes, risk-proportional guard chains, and anchor-bounded edits to LaTeX manuscripts (Wang et al., 15 Jun 2026). D-Judge intervenes inside an attacker’s multi-turn judge-driven jailbreak loop by rewriting the victim model’s outputs in a semantics-preserving way before the attacker’s judge sees them, thereby misaligning the feedback signal that drives prompt refinement (Gong et al., 31 May 2026).
5. Empirical evidence
The reported gains vary by domain, but the common empirical pattern is that explicit judgment criteria plus bounded refinement outperform direct generation or unguided baselines.
| System | Reported result | Citation |
|---|---|---|
| Agri-CPJ | +22.7 pp disease classification and +19.5 points QA over no-caption baseline | (Zhang et al., 26 Apr 2026) |
| De Jure | RAG responses preferred in 73.8% of cases at single-rule retrieval depth and 84.0% under broader retrieval | (Guliani et al., 2 Apr 2026) |
| Refine-n-Judge | Fine-tuned models preferred in over 74% of comparisons; +5% on AlpacaEval and AlpacaEval 2.0, +19% on MT-Bench | (Cayir et al., 3 Aug 2025) |
| JudgeFlow | Average 82.2 vs 80.8 for MermaidFlow | (Ma et al., 12 Jan 2026) |
| PaperJury | F1 = 0.656, Acc_v = 0.887, Acc_r = 0.913, ESVR = 0.025 | (Wang et al., 15 Jun 2026) |
| Probe-and-Refine | 33.0% mean resolve rate vs 28.3% static knowledge base and 25.5% unguided baseline | (Shepard et al., 18 Jun 2026) |
| Architectural reasoning pipeline | Up to 27.2% on SWE-bench Verified, up to 540% over base model and 256% over unfiltered fine-tuning | (Vasilevski et al., 12 Jun 2026) |
Additional evidence clarifies where improvements come from. Agri-CPJ’s ablations state that caption refinement is “the component with the largest individual impact,” and its human agreement study reports 94.2% preference match with two PhD plant pathologists, with Cohen’s $0.20, 0.20, 0.15, 0.10, 0.15, 0.10, 0.10$5 and Pearson $0.20, 0.20, 0.15, 0.10, 0.15, 0.10, 0.10$6 between judge scores and human scores (Zhang et al., 26 Apr 2026). Probe-and-refine tuning reports that the gain comes from coverage rather than precision: refined guidance yields evaluable patches for 14.5 percentage points more instances while per-patch precision remains statistically constant at $0.20, 0.20, 0.15, 0.10, 0.15, 0.10, 0.10$7 (Shepard et al., 18 Jun 2026). The curriculum-grounded marking pipeline reports direct gpt-5 marking with CCC $0.20, 0.20, 0.15, 0.10, 0.15, 0.10, 0.10$8 and weighted $0.20, 0.20, 0.15, 0.10, 0.15, 0.10, 0.10$9 against human marks, while the pipeline’s justifications win 80.4% of pairwise comparisons against human tutor justifications (Xu et al., 16 Jun 2026).
Taken together, these results suggest that judge-and-refine pipelines often gain less by making a single answer intrinsically “smarter” than by improving selection quality, coverage of relevant context, or auditability of intermediate artifacts.
6. Limitations, controversies, and open problems
The strongest limitation is that the judge is itself a modeling component with failure modes. RobustJudge shows that LLM-as-a-Judge systems remain vulnerable to adversarial attacks such as Combined Attack and PAIR, that robustness is highly sensitive to prompt template and judge-model selection, and that Re-tokenization and LLM-based detectors provide stronger protection than simpler prompt defenses (Li et al., 11 Jun 2025). D-Judge sharpens this point by demonstrating that semantics-preserving output rewrites can derail judge-driven jailbreak refinement loops while preserving benign-task performance, which implies that judge signals can be manipulated even when underlying response meaning is held constant (Gong et al., 31 May 2026).
A second controversy concerns whether the same model should act as both refiner and judge. Refine-n-Judge shows that a single LLM can fill both roles and still produce useful preference chains (Cayir et al., 3 Aug 2025). De Jure, by contrast, fixes a separate judge model and uses explicit criteria as a substitute for human annotation (Guliani et al., 2 Apr 2026). PaperJury pushes further, arguing that safety and completion should reside in deterministic orchestration rather than model discretion (Wang et al., 15 Jun 2026). This suggests a design tension between simplicity and modularity: single-model loops reduce system complexity, whereas separate judges and deterministic control reduce certain classes of feedback collapse and routing error.
A third limitation is domain and model dependence. Probe-and-refine tuning reports that Qwen-tuned guidance catastrophically harms NVIDIA-Nemotron-3-Nano-30B-A3B, even though per-patch precision remains constant when the agent loop reaches evaluation (Shepard et al., 18 Jun 2026). Themis reports that reference-guided judging helps close-ended scenarios such as Close QA, Math-related QA, Translation, and Reading comprehension & extraction, but has negligible or negative effect in several open-ended scenarios (Hu et al., 5 Feb 2025). Agri-CPJ notes failure modes in early-stage infections, co-infections, low image quality, and residual verbosity bias, and it explicitly removes response length from judge criteria after observing verbosity bias, reducing misselections from 0 to 1 (Zhang et al., 26 Apr 2026).
A final misconception is that any increase in judge complexity automatically yields robustness. The empirical record is more qualified. Panel discussion among meta-judges does not outperform simpler majority voting in the judgment-selection framework (Li et al., 23 Apr 2025); PaperJury’s strongest claim is not about larger judges but about deterministic routing, durable issue identity, and guard chains (Wang et al., 15 Jun 2026). This suggests that future progress will likely depend as much on rubric calibration, uncertainty handling, routing policy, and adversarial robustness as on stronger backbone models.