ConJudge: Contextual LLM Evaluation

Updated 27 November 2025

ConJudge is an LLM-based evaluation framework that assesses candidate outputs using explicit grounding context and calibrated ensemble techniques.
It employs an optimal minority-veto rule and regression-based bias correction to mitigate biases, achieving error rates as low as 1.2% in rigorous evaluations.
The framework applies a conditional hierarchy—evaluating refusal, faithfulness, completeness, and conciseness—to enhance consistency in diverse, context-dependent tasks.

ConJudge refers to a class of LLM-based evaluation frameworks and judge models focused on automated, high-fidelity assessment of candidate model outputs using grounded and context-aware criteria. ConJudge systems explicitly address the limitations of conventional LLM-as-a-judge paradigms by mitigating systematic biases (e.g., agreeableness bias) and enabling rigorous evaluation in both vanilla and context-dependent settings, particularly for Retrieval-Augmented Generation (RAG), code feedback, controlled summarization, and related applications (Jain et al., 13 Oct 2025, Xu et al., 19 Mar 2025, Liu et al., 26 Feb 2025).

1. Definition and Motivations

A ConJudge is an LLM-based "judge" model purpose-built to assess candidate responses with respect to both a user query and an explicit grounding context (e.g., retrieved documents, long passages, or structured data). Unlike traditional non-contextual judges that rely on "knowledge-free" comparison, ConJudge must detect hallucinations, refusal handling, completeness, and conciseness in a context-sensitive manner (Xu et al., 19 Mar 2025).

The motivation for ConJudge centers on the need for reliable, nuanced, and scalable evaluation workflows as LLMs enter rapid deployment cycles and application developers require mechanisms to decide on model switching and outcome validation. Human evaluation, while precise, is costly and unscalable. Early LLM-based evaluators suffer from positive ("agreeableness") bias, prompt sensitivity, contextual brittleness, and inadequate recognition of negative or incomplete outputs (Jain et al., 13 Oct 2025, Liu et al., 26 Feb 2025).

2. Key Biases and Failure Modes in LLM Judging

Quantitative analysis of validator LLMs on held-out human-annotated benchmarks reveals stark disparities between affirmative and negative recognition:

True Positive Rate (TPR): State-of-the-art LLM validators exhibit high average TPR ( $\geq$ 96%)—they nearly always affirm truly valid outputs.
True Negative Rate (TNR): Average TNR is markedly low ( $<$ 25%), demonstrating the LLMs' inability to reliably reject invalid outputs.
Class Imbalance: When invalid items comprise a small minority (e.g., 7.5% of all feedback), overall accuracy appears misleadingly high, masking the underlying leniency and miscalibration (Jain et al., 13 Oct 2025).

In contextual scenarios, additional failures include:

Length Bias: Tendency to prefer longer responses, resulting in poor performance on conciseness.
Prompt and Positional Sensitivity: Swapping candidate orders or evaluation prompt structure significantly alters decisions; inter-run consistency is often below 60%.
Failure to Execute Conditional Criteria: Judges frequently apply the wrong evaluation criterion in chain-of-thought explanations, further degrading reliability (Xu et al., 19 Mar 2025).

3. Ensemble and Regression-Based Debiasing in ConJudge

3.1 Optimal Minority-Veto Rule

To address agreeableness bias, ConJudge employs an optimal minority-veto ensemble policy:

Given $M$ validators, a veto threshold $n_{\mathrm{veto}}$ is set (e.g., 4 out of 14). For each item, the system:

Labels as "invalid" if at least $n_{\mathrm{veto}}$ validators give an "invalid" vote ( $v_j=0$ ), else "valid".
Ignores missing votes, making the rule robust to up to $M-n_{\mathrm{veto}}$ missing judgments.

Empirically, this approach (with $n_{\mathrm{veto}}=4$ ) yields a maximum absolute error (MaxAE) of ∼2.8%—substantially outperforming majority voting and remaining robust to incomplete ensembles (Jain et al., 13 Oct 2025).

3.2 Regression-Based Bias Correction

ConJudge introduces a principled regression framework that models both validator- and generator-specific properties:

Each validator $V_j$ has calibrated TPR $\alpha_j$ and TNR $\beta_j$ .
The probability $\rho_{ij}$ that $V_j$ affirms a candidate from generator $G_i$ is modeled as:

$\rho_{ij} = g_i\,\alpha_j + (1-g_i)(1-\beta_j)$

where $g_i$ is the true underlying generator precision.

Training minimizes a combined prediction-calibration loss anchored to a small set of human annotations, with careful weighting to correct for TNR underestimation.

With five annotated generators used in calibration, this approach attains a MaxAE of 1.2%—more than doubling the accuracy of the best ensemble method (Jain et al., 13 Oct 2025).

4. Contextual Evaluation: Hierarchies and Protocols

Modern ConJudge frameworks for contextual tasks (e.g., RAG, summarization) follow a structured, conditional evaluation hierarchy:

Refusal Validity: Judge whether a response correctly refuses to answer when context does not support the query.
Faithfulness: Determine which response is more anchored in the provided context.
Completeness: Among faithful responses, select the more comprehensive one.
Conciseness: If all else is equal, prefer the shorter, more succinct answer (Xu et al., 19 Mar 2025).

These steps are encoded in prompt templates and enforced via conditional prompting. The ContextualJudgeBench benchmark evaluates 2,000 response pairs across eight splits, each focused on one aspect of the hierarchy and spanning domains such as news, medical, meetings, and Wikipedia.

Consistent accuracy in this protocol requires a judge to produce correct, stable decisions with order-invariance. Even top models achieve only ∼55% consistent accuracy overall, with pronounced brittleness in handling refusal and conciseness criteria (Xu et al., 19 Mar 2025).

5. Judge-Consistency (ConsJudge) in Prompt-Stable Evaluation

An advanced methodology, ConsJudge, exploits cross-prompt stability to mitigate judgment variance and teach LLM-based judges self-consistency (Liu et al., 26 Feb 2025):

Multiple Hybrid Prompts: For $k$ distinct aspect combinations (e.g., Hallucination, Completeness, Coherence, Semantic Consistency), k prompts are generated per evaluation.
Judge-Consistency Score: For each judgment $r_i$ , the average pairwise cosine similarity (via a fixed embedding model) to all other $r_j$ yields a consistency score $S_i$ .
DPO Training: The model is explicitly optimized to prefer high-consistency judgments ( $r^+$ with $\max S_i$ ) over low-consistency ones ( $r^-$ ) via Direct Preference Optimization.

This approach increases agreement with superior LLM baselines (e.g., GLM-4-plus), improves RAG end-task performance (up to +2.06 points in accuracy vs. standard rewards), and reduces sensitivity to prompt choices.

Reward Model	MiniCPM-2.4B Avg. Acc	Llama3-8B Avg. Acc
RawMetric	59.99	63.63
Vanilla LLM	60.84	63.63
SFT	60.64	64.20
ConsJudge	61.04	65.69

6. Empirical Performance and Open Challenges

On code-feedback tasks (366-program, 14 validators), the regression-corrected ConJudge achieves 1.2% MaxAE with 5 annotated generators (Jain et al., 13 Oct 2025).
On ContextualJudgeBench, state-of-the-art models cap out at ∼55% consistent accuracy; conciseness and refusal (unanswerable QA) splits remain most challenging (Xu et al., 19 Mar 2025).

Notable bias/failure sources include:

Length bias favoring verbose responses.
Prompt/position sensitivity (up to 15-pt variance swap-dependent).
Insufficient contextual training data, especially for refusal cases.
Embedded chain-of-thought explanations often cite incorrect criteria.
Ensemble scaling (e.g., LLM-as-jury or self-consistency at inference) has negligible impact unless contextual finetuning is employed (Xu et al., 19 Mar 2025, Liu et al., 26 Feb 2025).

7. Integration Protocols and Best Practices

Pipeline Steps:

Fix validator pool and collect candidate outputs.
Gather validator judgments and compute minority-veto-based assessments.
Periodically update calibrated regression model with small, targeted human-annotated sets using active learning on cases of inter-validator disagreement.
Report generator precision using the bias-corrected regression estimate.

Best Practices:
- Select calibration samples from points of high validator disagreement.
- Calibrate regression-model hyperparameters using cross-validation on the annotated set, placing sufficient emphasis on TNR correction.
- Choose minority-veto thresholds mindful of TPR vs. TNR trade-off: lower thresholds offer better invalid rejection but may incur higher false negatives.
- For nuanced applications, use the regression-smoothed precision score over binary accept/reject decisions to allow for risk-contingent evaluation (Jain et al., 13 Oct 2025).
- In contextual evaluations, embed full conditional hierarchies in both training and inference prompts for robust, criterion-aware judgments (Xu et al., 19 Mar 2025, Liu et al., 26 Feb 2025).

Persistent challenges include extending reward models to low-resource languages, architecting training corpora that evenly span all conditional criteria (notably refusal handling), refining cross-prompt consistency measurement (e.g., moving beyond embedding similarity), and balancing inference-time computational costs (Liu et al., 26 Feb 2025).

PDF Markdown Chat (Pro)

References (3)

Beyond Consensus: Mitigating the Agreeableness Bias in LLM Judge Evaluations (2025)

Does Context Matter? ContextualJudgeBench for Evaluating LLM-based Judges in Contextual Settings (2025)

Judge as A Judge: Improving the Evaluation of Retrieval-Augmented Generation through the Judge-Consistency of Large Language Models (2025)

ConJudge: Contextual LLM Evaluation

1. Definition and Motivations

2. Key Biases and Failure Modes in LLM Judging

3. Ensemble and Regression-Based Debiasing in ConJudge

3.1 Optimal Minority-Veto Rule

3.2 Regression-Based Bias Correction

4. Contextual Evaluation: Hierarchies and Protocols

5. Judge-Consistency (ConsJudge) in Prompt-Stable Evaluation

6. Empirical Performance and Open Challenges

7. Integration Protocols and Best Practices

Whiteboard

Follow Topic

Continue Learning

ConJudge: Contextual LLM Evaluation

1. Definition and Motivations

2. Key Biases and Failure Modes in LLM Judging

3. Ensemble and Regression-Based Debiasing in ConJudge

3.1 Optimal Minority-Veto Rule

3.2 Regression-Based Bias Correction

4. Contextual Evaluation: Hierarchies and Protocols

5. Judge-Consistency (ConsJudge) in Prompt-Stable Evaluation

6. Empirical Performance and Open Challenges

7. Integration Protocols and Best Practices

Sponsor

Whiteboard

Follow Topic

Continue Learning

Related Topics