Coherence JSS Metric Analysis

Updated 9 May 2026

Coherence JSS is a quantitative metric that measures the stability of LLM judges by comparing consistency in their Likert-scale ratings across semantically equivalent prompts.
It uses paired paraphrase prompts to evaluate how consistently different models rate text summaries, revealing significant variability in prompt sensitivity.
Results indicate that models with rigorous instruction tuning, rather than sheer scale, tend to achieve higher JSS, emphasizing the role of prompt design in evaluation reliability.

The Coherence Judge Sensitivity Score (Coherence JSS) is a quantitative metric for assessing the stability of LLM judges when their evaluation prompts are reworded in semantically equivalent forms. Coherence JSS, defined within the JudgeSense framework, measures the fraction of paraphrase pairs for which an LLM judge produces identical judgments on the coherence of a text summary. Variability in this score across state-of-the-art models uncovers robust differences in prompt sensitivity, which are not predictable by model scale alone. Coherence JSS serves as a crucial reliability measure for the use of LLMs as automated evaluation agents in natural language processing tasks such as summary assessment, where Likert scale outputs and prompt rewording are common practice (Bellibatlu, 26 Apr 2026).

1. Formal Definition of Coherence JSS

In the JudgeSense benchmark, the Judge Sensitivity Score (JSS) for a judge model $j$ , evaluation task $t$ , and a validated set $P$ of paraphrase pairs $\{(p_i, p_i')\}_{i=1}^{|P|}$ is: $\mathrm{JSS}(j,t) = \frac{1}{|P|} \sum_{i=1}^{|P|} \delta(j(p_i),\,j(p_i'))$ where $\delta(a, b) = 1$ if $a = b$ and $0$ otherwise. By design, $0 \leq \mathrm{JSS} \leq 1$ , with a value of $1$ indicating perfect verdict stability under paraphrase, and lower values reflecting greater instability, i.e., a higher fraction of flips due to rewording. For the coherence task, JSS captures how consistently an LLM rates the coherence of a summary on a 1–5 Likert scale across semantically equivalent prompt templates.

2. Coherence as a Judgment Task

Coherence evaluation in JudgeSense utilizes 125 summaries from the SummEval benchmark. Each summary is rated via five minimalist instruction templates, forming 125 paraphrase pairs, all of which are independently validated for semantic equivalence using a GPT-4o-mini classifier. The five exact Likert-scale coherence templates are:

“Rate the coherence of this summary from 1 to 5.”
“On a scale from 1 to 5, how coherent is this summary?”
“Score this summary’s coherence (1 = incoherent, 5 = highly coherent). Reply with the digit only.”
“How well does this summary hang together? Rate 1–5.”
“Coherence rating for this summary, 1 to 5. One number only.”

No chain-of-thought reasoning, role priming, or explicit schema is used—only direct instruction and a short system prompt (“You are an evaluation assistant. Give only the requested answer with no explanation”). This methodology isolates the effect of rewording on LLM judge decision consistency (Bellibatlu, 26 Apr 2026).

3. Experimental Protocol

Nine LLM judge models are evaluated using greedy decoding (temperature = 0.0/0.01, max_tokens = 20, top_p default). System prompts are provided via the standard system role where possible, otherwise appended as a user prefix. For particular platforms (e.g., Gemini), chain-of-thought suppression is enforced by setting “thinking_budget” to zero. The nine judges and approximate parameter scales are: OpenAI GPT-4o, GPT-4o-mini, Anthropic Claude-sonnet-4-5, Claude-haiku-4-5, Google Gemini-2.5-flash, Meta Llama-3.1-70B-Instruct, Mistral-7B, Qwen-2.5-72B-Instruct, and DeepSeek-R1. Each paraphrase is issued in three replicate runs per template per model, yielding an effective sample size of $t$ 0 for most; Llama-3.1-70B achieves $t$ 1 due to parser rejection (Bellibatlu, 26 Apr 2026).

4. Coherence JSS Outcomes Across Models

The observed Coherence JSS values vary significantly across models:

Model	JSS	Flip rate	Cohen’s $t$ 2	95% CI	$t$ 3
Claude-sonnet-4-5	0.992	0.008	0.986	[0.981, 1.000]	375
Qwen-2.5-72B-Instruct	0.920	0.080	0.846	[0.892, 0.946]	351
GPT-4o	0.915	0.085	0.828	[0.888, 0.941]	375
GPT-4o-mini	0.784	0.216	0.627	[0.744, 0.824]	375
Claude-haiku-4-5	0.731	0.269	0.583	[0.688, 0.776]	375
DeepSeek-R1	0.653	0.347	0.461	[0.605, 0.707]	375
Llama-3.1-70B-Instruct	0.554	0.446	0.338	[0.488, 0.615]	260
Mistral-7B	0.480	0.520	–0.082	[0.429, 0.536]	375
Gemini-2.5-flash	0.389	0.611	–0.053	[0.338, 0.441]	370

The mean coherence JSS is approximately $t$ 4 with a median of $t$ 5. The score range spans $t$ 6, from a minimum of $t$ 7 (Gemini-2.5-flash) to a maximum of $t$ 8 (Claude-sonnet-4-5).

Models such as Claude-sonnet-4-5, Qwen-2.5-72B, and GPT-4o achieve JSS $t$ 9 (flip rates $P$ 0 and Cohen’s $P$ 1 in the range $P$ 2– $P$ 3), reflecting high robustness. In contrast, Gemini-2.5-flash and Mistral-7B not only exhibit low JSS ( $P$ 4) but negative $P$ 5 values, indicating that their decisions under paraphrase are anti-correlated—worse than random stability (Bellibatlu, 26 Apr 2026).

5. Factors Affecting Coherence JSS

Coherence is distinct from other tasks in JudgeSense due to its Likert-scale (multi-class) rating, in contrast to binary outputs for factuality and pairwise judgments. This granularity amplifies decision drift: shifts among neighboring Likert classes (“4” to “3”) are recorded as flips, whereas minor binary judgment shifts often remain undetected. As a result, coherence JSS spans a wider range, while factuality JSS is compressed ( $P$ 6 before template correction, $P$ 7 after), and pairwise tasks either cluster at $P$ 8 (“always-A” degeneration) or $P$ 9 (random flipping).

No simple monotonic relationship exists between model scale and JSS. High-parameter models do not guarantee higher consistency. Instead, coherence JSS appears to track instruction tuning rigor and possibly the diversity of paraphrase-style instruction exemplars observed during model training. Notably, Claude-sonnet-4-5 (tight RLHF regime) sets the stability benchmark for coherence, while Gemini-2.5-flash, despite its flagship status, underperforms dramatically (Bellibatlu, 26 Apr 2026).

Negative $\{(p_i, p_i')\}_{i=1}^{|P|}$ 0 values for some models provide an anti-reliability signal: their outputs are less consistent under paraphrase than random agreement would predict.

6. Recommendations and Best Practices

For high-stakes or scientific settings where prompt variation is unavoidable, a judge model with coherence JSS $\{(p_i, p_i')\}_{i=1}^{|P|}$ 1 (e.g., Claude-sonnet-4-5, Qwen-2.5-72B, GPT-4o) is recommended. Moderate-consistency cases may tolerate GPT-4o-mini or Claude-haiku-4-5. Models with JSS $\{(p_i, p_i')\}_{i=1}^{|P|}$ 2 (Llama-3.1-70B, Mistral-7B, Gemini-2.5-flash) should be avoided for coherence unless prompts are tightly controlled or additional post-processing normalizes outputs.

Template authoring should use a small, vetted set of prompts with equivalence validated by an independent classifier. Reporting of coherence JSS is advised in all evaluation papers alongside human-agreement or correlation metrics; consistency under paraphrase (high JSS) is orthogonal to agreement with human gold labels (Bellibatlu, 26 Apr 2026).

A plausible implication is that wider adoption of JSS-based reporting, specifically on multi-class (Likert-scale) tasks, could lead to more robust LLM-based evaluation pipelines and drive improvements in LLM training regimes to directly target paraphrase-invariant reasoning.

7. Implications for Automated Judging in NLP Evaluation

Coherence JSS exposes prompt sensitivity as a critical but previously overlooked variable in LLM-as-a-judge systems. Substantial instability across models for the same evaluation item under minimally varied instructions forecloses interchangeability among LLM judges for coherence tasks. Reliance on reputation, size, or manufacturer specifications cannot substitute for direct JSS measurement. Coherence JSS establishes a standardized axis of reliability, informing the comparable robustness of evaluation schemes and providing a benchmark for future improvements in LLM alignment with human-like judgment invariance (Bellibatlu, 26 Apr 2026).

Markdown Report Issue Upgrade to Chat

References (1)

JudgeSense: A Benchmark for Prompt Sensitivity in LLM-as-a-Judge Systems (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Coherence JSS.

Coherence JSS Metric Analysis

1. Formal Definition of Coherence JSS

2. Coherence as a Judgment Task

3. Experimental Protocol

4. Coherence JSS Outcomes Across Models

5. Factors Affecting Coherence JSS

6. Recommendations and Best Practices

7. Implications for Automated Judging in NLP Evaluation

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Coherence JSS Metric Analysis

1. Formal Definition of Coherence JSS

2. Coherence as a Judgment Task

3. Experimental Protocol

4. Coherence JSS Outcomes Across Models

5. Factors Affecting Coherence JSS

6. Recommendations and Best Practices

7. Implications for Automated Judging in NLP Evaluation

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research