Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 145 tok/s

Gemini 2.5 Pro 40 tok/s Pro

GPT-5 Medium 22 tok/s Pro

GPT-5 High 23 tok/s Pro

GPT-4o 107 tok/s Pro

Kimi K2 195 tok/s Pro

GPT OSS 120B 446 tok/s Pro

Claude Sonnet 4.5 36 tok/s Pro

2000 character limit reached

The Silent Judge: Unacknowledged Shortcut Bias in LLM-as-a-Judge (2509.26072v1)

Published 30 Sep 2025 in cs.CL

Abstract: LLMs are increasingly deployed as automatic judges to evaluate system outputs in tasks such as summarization, dialogue, and creative writing. A faithful judge should base its verdicts solely on response quality and explicitly acknowledge the factors shaping its decision. We show that current LLM judges fail on both counts by relying on shortcuts introduced in the prompt. Our study uses two evaluation datasets: ELI5, a benchmark for long-form question answering, and LitBench, a recent benchmark for creative writing. Both datasets provide pairwise comparisons, where the evaluator must choose which of two responses is better. From each dataset we construct 100 pairwise judgment tasks and employ two widely used models, GPT-4o and Gemini-2.5-Flash, as evaluators in the role of LLM-as-a-judge. For each pair, we assign superficial cues to the responses, provenance cues indicating source identity (Human, Expert, LLM, or Unknown) and recency cues indicating temporal origin (Old, 1950 vs. New, 2025), while keeping the rest of the prompt fixed. Results reveal consistent verdict shifts: both models exhibit a strong recency bias, systematically favoring new responses over old, as well as a clear provenance hierarchy (Expert > Human > LLM > Unknown). These biases are especially pronounced in GPT-4o and in the more subjective and open-ended LitBench domain. Crucially, cue acknowledgment is rare: justifications almost never reference the injected cues, instead rationalizing decisions in terms of content qualities. These findings demonstrate that current LLM-as-a-judge systems are shortcut-prone and unfaithful, undermining their reliability as evaluators in both research and deployment.

Summary

The paper demonstrates that LLM judges rely on superficial cues, showing recency and provenance biases in evaluations.
Experiments using ELI5 and LitBench reveal significant verdict shifts, with GPT-4o showing higher cue sensitivity than Gemini-2.5-Flash.
Findings stress the need to mitigate shortcut biases to improve AI judge reliability and ensure more faithful content assessments.

The Silent Judge: Unacknowledged Shortcut Bias in LLM-as-a-Judge

Introduction

The deployment of LLMs as judges in various applications is gaining momentum, drawn by their scalability and correlation with human evaluation preferences. Their application spans several domains, such as summarization, dialogue, and creative writing, where these models serve as evaluators of system outputs. Despite advancements, it is critical to understand the vulnerabilities of these LLMs, particularly their propensity to rely on superficial shortcuts that might introduce systematic biases.

Evaluation Dataset and Model Setup

This investigation leverages two datasets, ELI5 and LitBench, to evaluate pairwise judgments. The LLMs chosen for this analysis, GPT-4o and Gemini-2.5-Flash, were administered tasks where superficial cues like provenance and recency were assigned to the responses. These cues were designed to test whether the models could provide unbiased judgments.

Experimental Findings

The paper reveals a consistent pattern of bias among the LLMs when superficial cues are introduced: a notable inclination toward "new" responses over "old" ones (recency bias), and a preference hierarchy based on source labeling, with a clear order of preference being Expert > Human > LLM > Unknown. This bias is vividly more pronounced in the more subjective LitBench domain than in the more factual ELI5.

Figure 1: Verdict Shift Rate (VSR) for recency cues. VSR is computed as the difference in selection rates between the New–Old and Old–New cue assignments. Positive values indicate a preference for responses labeled as New (2025) over those labeled as Old (1950).

Bias Insights and Implications

The observed biases indicate a reliance on temporal and authority cues as shortcuts in decision-making processes. GPT-4o, in particular, showed a higher cue sensitivity compared to Gemini-2.5-Flash, especially noticeable in provenance-based evaluations. The paper highlights the disparity between GPT-4o's receptive sensitivity to authoritative presentation (Expert labeling) and a slight inclination shown by Gemini-2.5-Flash in hierarchical order among provenance cues.

Evaluation Faithfulness

Surprisingly, despite significant verdict shifts driven by superficial cues, the explanation rationales provided by the models did not acknowledge these cues. Models justified their decisions based on supposed content qualities rather than the cues, marking a clear lack of faithfulness in explanation.

Conclusion

The paper underscores the critical need for addressal of shortcut-driven biases in LLM judges. Current LLM evaluators depict a marked vulnerability towards artificial cues, undermining their reliability. Addressing these biases and improving cue acknowledgment rates should be prioritized to enhance the credibility of LLM judgments in practical applications. Continued work is necessary to develop techniques ensuring these AI judges adhere strictly to content quality without undue influence from non-intrinsic factors. The investigation offers pivotal insights into the realistic deployment of LLMs as unbiased evaluators in various NLP tasks.