DualJudge Hybrid Fusion for LLM Evaluation
- The paper demonstrates that fusing direct holistic scoring with structured AHP significantly improves LLM evaluation reliability and human alignment on benchmarks.
- DualJudge Hybrid Fusion employs a dynamic, consistency-aware weighting scheme that balances efficient scoring with detailed criterion analysis for robust performance.
- Empirical results reveal that integrating uncertainty-aware fuzzy AHP yields higher accuracy gains, particularly benefiting mid-tier and weaker LLM judges.
DualJudge Hybrid Fusion is a hybrid evaluation framework that integrates direct holistic scoring and structured Analytic Hierarchy Process (AHP) judgments to improve the reliability and interpretability of LLM evaluation. Inspired by dual-process cognitive theory, DualJudge adaptively fuses “System 1” (intuitive, direct scoring) and “System 2” (deliberative, criterion-structured) outputs through a dynamic, consistency-aware weighting scheme. This hybrid is motivated by the complementary error profiles of the two evaluation modes: direct scoring offers efficiency but lacks dimensional transparency, while structured decomposition affords rigorous criteria-level calibration at higher computational cost. DualJudge achieves state-of-the-art human alignment on the JudgeBench benchmark, especially for mid-to-weaker LLM judges, establishing the utility of uncertainty-aware structured reasoning in LLM assessment (He et al., 4 Apr 2026).
1. Motivation for Hybrid Fusion
Conventional LLM evaluation protocols generally adopt one of two paradigms. Direct scoring (System 1) prompts the LLM to issue a single holistic score or preference in response to candidate answers. This mechanism is highly efficient, featuring minimal prompting overhead, but suffers from several critical limitations: it conflates multiple evaluation dimensions in a single formative step, exhibits high sensitivity to prompt format and scoring scale, and its judgments are often inconsistent in logic-intensive or multi-aspect tasks.
In contrast, the Analytic Hierarchy Process (System 2) decomposes the evaluation process into explicit criteria, during which the LLM performs pairwise comparisons across all criteria and aggregates the results according to the canonical AHP framework. This affords explicit, interpretable criterion-level weights, enables robust consistency checking (via the Consistency Ratio, CR), and yields superior performance for domains with substantial reasoning demands. The trade-off is an increased annotation burden, with comparisons per round, and potential instability from noisy or inconsistent pairwise judgments.
Empirical results reveal marked complementarity between these two regimes: weaker models benefit substantially from the structured scaffold of AHP, while stronger LLMs inherently internalize many necessary trade-offs across criteria. The differential error profiles of direct and structured judgments motivate their hybrid fusion in a unified, reliability-weighted aggregation.
2. DualJudge Architecture
DualJudge comprises two branches executed in parallel:
- Absolute (Direct) Scoring Branch: The LLM is prompted for a single holistic score or, in some scenarios, a direct pairwise pick. Scores are normalized to when required.
- Structured AHP Branch: A predefined set of criteria is designated for the evaluation category. For each criterion pair , the LLM outputs a Saaty-scale score (reflecting relative importance or performance) and a per-comparison meta-confidence . These are assembled into a reciprocal comparison matrix (either crisp or fuzzy), subjected to consistency repair as needed, and processed per AHP (or Fuzzy AHP) to yield an aggregate criterion-weighted score .
- Consistency-Aware Fusion: The matrix’s Consistency Ratio (CR) quantifies the internal regularity of the structured ratings. This yields a reliability score 0, with 1. The final output is a convex combination:
2
Fusion thus privilege the more consistent branch on an adaptive, per-instance basis, balancing directness with structure.
3. Mathematical Formulation
The AHP comparison matrix 3 collects lower-triangular pairwise judgments according to: 4
- Crisp AHP Weights: The principal eigenvector 5 of 6, normalized to sum to unity, is computed. Consistency is measured as
7
where 8 is the standard random-consistency index. If 9 (with 0), local repair is applied by enforcing 1; if irreparable, comparisons are re-generated.
- Fuzzy AHP with Confidence: Each discrete comparison is mapped to a triangular fuzzy number (TFN) 2. The interval is contracted by the LLM’s meta-confidence 3: 4 This produces 5; reciprocity holds via 6. Fuzzy geometric means 7 and normalized TFN criterion weights 8 are calculated, then defuzzified through centroids.
- Fusion Weighting: The final output is regulated by 9 (0), weighting the relative contributions of 1 and 2.
4. Pseudocode and Workflow
The DualJudge workflow is formally specified as follows:
4 Both crisp and fuzzy AHP variants share the same repaired matrix structure, differing only in the propagation of epistemic uncertainty via the TFN’s width, modulated by 3. High-confidence comparisons contract toward the mode, while low confidence preserves a wider interval, reflecting higher uncertainty.
5. Experimental Results and Performance Analysis
Extensive evaluation on the JudgeBench dataset confirms the empirical benefits of DualJudge for a range of LLM evaluators and task granularities. Summary metrics indicate the following trends:
| Model | Scale | Count | Direct | Crisp AHP | Fuzzy AHP | D+C (DualJudge) | D+F (DualJudge) |
|---|---|---|---|---|---|---|---|
| gpt-oss-20b | 1–10 | 620 | 69.83% | 74.52% | 75.35% | 76.95% (+7.1) | 77.60% (+7.8) |
| gpt-oss-120b | 1–10 | 620 | 75.81% | 80.49% | 80.97% | 82.10% (+6.3) | 82.10% (+6.3) |
| qwen3.5-9b | 1–10 | 620 | 82.70% | 81.44% | 83.55% | 84.07% | 84.03% |
| qwen3.5-35b | 1–10 | 620 | 87.38% | 83.65% | 85.47% | 87.19% | 87.19% |
“+X” in parentheses denotes improvement over the direct scoring baseline. Fuzzy AHP generally outperforms its crisp counterpart, particularly for mid-tier judges. DualJudge (in both D+C and D+F variants) produces the highest overall accuracy in 7 out of 8 configurations.
- Weaker judges (e.g., gpt-oss-20B) realize the greatest increases from structured and hybrid scoring, with gains up to +7.77 percentage points relative to direct scoring.
- Mid-tier models benefit specifically from Fuzzy AHP’s explicit uncertainty modeling (up to +1.0 point over Crisp AHP).
- The strongest judges (e.g., qwen3.5-35B) show near-parity between baselines and hybrid methods, with gains narrowing to +0.2 points.
Appendix breakdowns indicate domains requiring intensive reasoning (mathematics, code) gain most from AHP decomposition, while more factoid-oriented tasks exhibit smaller differentials.
6. Complementary Strengths and Ablation Insights
Ablation studies demonstrate that evaluation mode selection should be informed by the underlying LLM evaluator’s calibration:
- Mid-range models favor DualJudge’s Fuzzy variant (D+F), gaining from explicit uncertainty integration.
- Strong models attain equivalent accuracy under DualJudge’s Crisp hybrid (D+C), as their direct scores are already well-calibrated.
- Structured and hybrid scoring yield maximal improvements on logic-demanding task classes, supporting the core intuition of dual-process complementarity.
A plausible implication is that task adaptivity—modulating fusion strategy based on model capacity and domain demand—could further optimize calibration and human-alignment in future assessment frameworks.
7. Implications and Outlook
DualJudge establishes a theoretically grounded, lightweight, and empirically validated blueprint for LLM evaluation that systematically harnesses both intuitive and deliberative adjudication paradigms. The introduction of LLM-generated meta-confidence into Fuzzy AHP enables principled management of epistemic uncertainty at the comparison level.
Findings from JudgeBench experiments strongly support the claim that uncertainty-aware, consistency-modulated hybrid fusion achieves superior agreement with human pairwise preferences, especially where the evaluator LLM is less calibrated or the target domain is arduous. These results suggest DualJudge-type fusion architectures may be instrumental in forthcoming LLM evaluation criteria, particularly for emergent open-ended or complex reasoning tasks (He et al., 4 Apr 2026).