Papers
Topics
Authors
Recent
Search
2000 character limit reached

DualJudge Hybrid Fusion for LLM Evaluation

Updated 9 May 2026
  • The paper demonstrates that fusing direct holistic scoring with structured AHP significantly improves LLM evaluation reliability and human alignment on benchmarks.
  • DualJudge Hybrid Fusion employs a dynamic, consistency-aware weighting scheme that balances efficient scoring with detailed criterion analysis for robust performance.
  • Empirical results reveal that integrating uncertainty-aware fuzzy AHP yields higher accuracy gains, particularly benefiting mid-tier and weaker LLM judges.

DualJudge Hybrid Fusion is a hybrid evaluation framework that integrates direct holistic scoring and structured Analytic Hierarchy Process (AHP) judgments to improve the reliability and interpretability of LLM evaluation. Inspired by dual-process cognitive theory, DualJudge adaptively fuses “System 1” (intuitive, direct scoring) and “System 2” (deliberative, criterion-structured) outputs through a dynamic, consistency-aware weighting scheme. This hybrid is motivated by the complementary error profiles of the two evaluation modes: direct scoring offers efficiency but lacks dimensional transparency, while structured decomposition affords rigorous criteria-level calibration at higher computational cost. DualJudge achieves state-of-the-art human alignment on the JudgeBench benchmark, especially for mid-to-weaker LLM judges, establishing the utility of uncertainty-aware structured reasoning in LLM assessment (He et al., 4 Apr 2026).

1. Motivation for Hybrid Fusion

Conventional LLM evaluation protocols generally adopt one of two paradigms. Direct scoring (System 1) prompts the LLM to issue a single holistic score or preference in response to candidate answers. This mechanism is highly efficient, featuring minimal prompting overhead, but suffers from several critical limitations: it conflates multiple evaluation dimensions in a single formative step, exhibits high sensitivity to prompt format and scoring scale, and its judgments are often inconsistent in logic-intensive or multi-aspect tasks.

In contrast, the Analytic Hierarchy Process (System 2) decomposes the evaluation process into KK explicit criteria, during which the LLM performs pairwise comparisons across all criteria and aggregates the results according to the canonical AHP framework. This affords explicit, interpretable criterion-level weights, enables robust consistency checking (via the Consistency Ratio, CR), and yields superior performance for domains with substantial reasoning demands. The trade-off is an increased annotation burden, with K(K1)/2K(K-1)/2 comparisons per round, and potential instability from noisy or inconsistent pairwise judgments.

Empirical results reveal marked complementarity between these two regimes: weaker models benefit substantially from the structured scaffold of AHP, while stronger LLMs inherently internalize many necessary trade-offs across criteria. The differential error profiles of direct and structured judgments motivate their hybrid fusion in a unified, reliability-weighted aggregation.

2. DualJudge Architecture

DualJudge comprises two branches executed in parallel:

  • Absolute (Direct) Scoring Branch: The LLM is prompted for a single holistic score Sabs[0,1]S_{\mathrm{abs}} \in [0,1] or, in some scenarios, a direct pairwise pick. Scores are normalized to [0,1][0,1] when required.
  • Structured AHP Branch: A predefined set of KK criteria C={c1,...,cK}C = \{c_1, ..., c_K\} is designated for the evaluation category. For each criterion pair (ci,cj)(c_i, c_j), the LLM outputs a Saaty-scale score sij{1,...,9}s_{ij} \in \{1, ..., 9\} (reflecting relative importance or performance) and a per-comparison meta-confidence γij[0,1]\gamma_{ij} \in [0,1]. These are assembled into a reciprocal comparison matrix (either crisp or fuzzy), subjected to consistency repair as needed, and processed per AHP (or Fuzzy AHP) to yield an aggregate criterion-weighted score Sahp[0,1]S_{\mathrm{ahp}} \in [0,1].
  • Consistency-Aware Fusion: The matrix’s Consistency Ratio (CR) quantifies the internal regularity of the structured ratings. This yields a reliability score K(K1)/2K(K-1)/20, with K(K1)/2K(K-1)/21. The final output is a convex combination:

K(K1)/2K(K-1)/22

Fusion thus privilege the more consistent branch on an adaptive, per-instance basis, balancing directness with structure.

3. Mathematical Formulation

The AHP comparison matrix K(K1)/2K(K-1)/23 collects lower-triangular pairwise judgments according to: K(K1)/2K(K-1)/24

  • Crisp AHP Weights: The principal eigenvector K(K1)/2K(K-1)/25 of K(K1)/2K(K-1)/26, normalized to sum to unity, is computed. Consistency is measured as

K(K1)/2K(K-1)/27

where K(K1)/2K(K-1)/28 is the standard random-consistency index. If K(K1)/2K(K-1)/29 (with Sabs[0,1]S_{\mathrm{abs}} \in [0,1]0), local repair is applied by enforcing Sabs[0,1]S_{\mathrm{abs}} \in [0,1]1; if irreparable, comparisons are re-generated.

  • Fuzzy AHP with Confidence: Each discrete comparison is mapped to a triangular fuzzy number (TFN) Sabs[0,1]S_{\mathrm{abs}} \in [0,1]2. The interval is contracted by the LLM’s meta-confidence Sabs[0,1]S_{\mathrm{abs}} \in [0,1]3: Sabs[0,1]S_{\mathrm{abs}} \in [0,1]4 This produces Sabs[0,1]S_{\mathrm{abs}} \in [0,1]5; reciprocity holds via Sabs[0,1]S_{\mathrm{abs}} \in [0,1]6. Fuzzy geometric means Sabs[0,1]S_{\mathrm{abs}} \in [0,1]7 and normalized TFN criterion weights Sabs[0,1]S_{\mathrm{abs}} \in [0,1]8 are calculated, then defuzzified through centroids.
  • Fusion Weighting: The final output is regulated by Sabs[0,1]S_{\mathrm{abs}} \in [0,1]9 ([0,1][0,1]0), weighting the relative contributions of [0,1][0,1]1 and [0,1][0,1]2.

4. Pseudocode and Workflow

The DualJudge workflow is formally specified as follows:

[0,1][0,1]4 Both crisp and fuzzy AHP variants share the same repaired matrix structure, differing only in the propagation of epistemic uncertainty via the TFN’s width, modulated by [0,1][0,1]3. High-confidence comparisons contract toward the mode, while low confidence preserves a wider interval, reflecting higher uncertainty.

5. Experimental Results and Performance Analysis

Extensive evaluation on the JudgeBench dataset confirms the empirical benefits of DualJudge for a range of LLM evaluators and task granularities. Summary metrics indicate the following trends:

Model Scale Count Direct Crisp AHP Fuzzy AHP D+C (DualJudge) D+F (DualJudge)
gpt-oss-20b 1–10 620 69.83% 74.52% 75.35% 76.95% (+7.1) 77.60% (+7.8)
gpt-oss-120b 1–10 620 75.81% 80.49% 80.97% 82.10% (+6.3) 82.10% (+6.3)
qwen3.5-9b 1–10 620 82.70% 81.44% 83.55% 84.07% 84.03%
qwen3.5-35b 1–10 620 87.38% 83.65% 85.47% 87.19% 87.19%

“+X” in parentheses denotes improvement over the direct scoring baseline. Fuzzy AHP generally outperforms its crisp counterpart, particularly for mid-tier judges. DualJudge (in both D+C and D+F variants) produces the highest overall accuracy in 7 out of 8 configurations.

  • Weaker judges (e.g., gpt-oss-20B) realize the greatest increases from structured and hybrid scoring, with gains up to +7.77 percentage points relative to direct scoring.
  • Mid-tier models benefit specifically from Fuzzy AHP’s explicit uncertainty modeling (up to +1.0 point over Crisp AHP).
  • The strongest judges (e.g., qwen3.5-35B) show near-parity between baselines and hybrid methods, with gains narrowing to +0.2 points.

Appendix breakdowns indicate domains requiring intensive reasoning (mathematics, code) gain most from AHP decomposition, while more factoid-oriented tasks exhibit smaller differentials.

6. Complementary Strengths and Ablation Insights

Ablation studies demonstrate that evaluation mode selection should be informed by the underlying LLM evaluator’s calibration:

  • Mid-range models favor DualJudge’s Fuzzy variant (D+F), gaining from explicit uncertainty integration.
  • Strong models attain equivalent accuracy under DualJudge’s Crisp hybrid (D+C), as their direct scores are already well-calibrated.
  • Structured and hybrid scoring yield maximal improvements on logic-demanding task classes, supporting the core intuition of dual-process complementarity.

A plausible implication is that task adaptivity—modulating fusion strategy based on model capacity and domain demand—could further optimize calibration and human-alignment in future assessment frameworks.

7. Implications and Outlook

DualJudge establishes a theoretically grounded, lightweight, and empirically validated blueprint for LLM evaluation that systematically harnesses both intuitive and deliberative adjudication paradigms. The introduction of LLM-generated meta-confidence into Fuzzy AHP enables principled management of epistemic uncertainty at the comparison level.

Findings from JudgeBench experiments strongly support the claim that uncertainty-aware, consistency-modulated hybrid fusion achieves superior agreement with human pairwise preferences, especially where the evaluator LLM is less calibrated or the target domain is arduous. These results suggest DualJudge-type fusion architectures may be instrumental in forthcoming LLM evaluation criteria, particularly for emergent open-ended or complex reasoning tasks (He et al., 4 Apr 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DualJudge Hybrid Fusion.