Judge Sensitivity Score (JSS)

Updated 9 May 2026

Judge Sensitivity Score (JSS) is a metric that quantifies the consistency and robustness of LLM-based judges by evaluating output stability under prompt, configuration, and parameter variations.
Multiple formulations of JSS assess aspects such as paraphrase agreement, score-range variability, input corruption effects, and sensitivity parameters in multi-judge ranking frameworks.
Empirical studies reveal that JSS values vary widely across tasks and settings, informing best practices for prompt design, calibration, and transparent evaluation methodologies.

A Judge Sensitivity Score (JSS) quantifies the degree to which an automated judge—typically a LLM acting as an evaluator—produces consistent or robust scoring under superficial changes to its prompts, configuration, and evaluation context. Multiple independent research threads have developed distinct operationalizations of JSS: (1) as an empirical stability metric over paraphrased or perturbed prompts, (2) as a score-range variance measure, (3) as a semantic change-detection statistic, and (4) as an explicit model parameter in multi-judge ranking frameworks. These variants share a common goal: to expose and quantify the sensitivity or variability of LLM-based judges beyond raw accuracy. This article synthesizes technical definitions, empirical practices, and inferential interpretations of JSS as documented in the literature.

1. Formal Definitions of Judge Sensitivity Score

1.1. Paraphrase Agreement Metric

The most direct definition, from "JudgeSense: A Benchmark for Prompt Sensitivity in LLM-as-a-Judge Systems" (Bellibatlu, 26 Apr 2026), formalizes JSS for a judge $j$ on a set of $|P|$ paraphrase pairs as: $\mathrm{JSS}(j, t) = \frac{1}{|P|} \sum_{i=1}^{|P|} \delta(j(p_{i}), j(p_{i}'))$ where $p_i, p_i'$ are semantically equivalent prompts (paraphrases) for the same task $t$ , and $\delta(a, b) = 1$ if $a = b$ , else $0$. Thus, JSS ranges from $0$ (fully inconsistent) to $1$ (perfectly stable under paraphrase), directly reporting the fraction of pairs for which the judge yields matching outputs.

1.2. Score-Range Sensitivity

In direct-assessment tasks, JSS quantifies the amplitude of performance drift when varying the allowed numeric range for scores. Let $|P|$ 0 denote the family of candidate score ranges (e.g., $|P|$ 1– $|P|$ 2, $|P|$ 3– $|P|$ 4, $|P|$ 5– $|P|$ 6, $|P|$ 7– $|P|$ 8). If $|P|$ 9 is the correlation (typically Spearman's $\mathrm{JSS}(j, t) = \frac{1}{|P|} \sum_{i=1}^{|P|} \delta(j(p_{i}), j(p_{i}'))$ 0 or Pearson's $\mathrm{JSS}(j, t) = \frac{1}{|P|} \sum_{i=1}^{|P|} \delta(j(p_{i}), j(p_{i}'))$ 1) between LLM scores and human reference under range $\mathrm{JSS}(j, t) = \frac{1}{|P|} \sum_{i=1}^{|P|} \delta(j(p_{i}), j(p_{i}'))$ 2, JSS is defined as:

Range-Span Form: $\mathrm{JSS}(j, t) = \frac{1}{|P|} \sum_{i=1}^{|P|} \delta(j(p_{i}), j(p_{i}'))$ 3
Std-Dev Form: $\mathrm{JSS}(j, t) = \frac{1}{|P|} \sum_{i=1}^{|P|} \delta(j(p_{i}), j(p_{i}'))$ 4, with $\mathrm{JSS}(j, t) = \frac{1}{|P|} \sum_{i=1}^{|P|} \delta(j(p_{i}), j(p_{i}'))$ 5 the mean correlation.

A lower JSS indicates more stable agreement across range choices (Fujinuma, 21 Oct 2025).

1.3. Score-Drop Sensitivity to Perturbations

For perturbative robustness analysis, JSS is computed as the mean drop in judge score following deliberate input corruptions (as in LLM judgment of image segmentation quality (Hossain et al., 7 Apr 2026)): $\mathrm{JSS}(j, t) = \frac{1}{|P|} \sum_{i=1}^{|P|} \delta(j(p_{i}), j(p_{i}'))$ 6 where $\mathrm{JSS}(j, t) = \frac{1}{|P|} \sum_{i=1}^{|P|} \delta(j(p_{i}), j(p_{i}'))$ 7 is the judge's score on the uncorrupted instance, and $\mathrm{JSS}(j, t) = \frac{1}{|P|} \sum_{i=1}^{|P|} \delta(j(p_{i}), j(p_{i}'))$ 8 is the score after corruption family $\mathrm{JSS}(j, t) = \frac{1}{|P|} \sum_{i=1}^{|P|} \delta(j(p_{i}), j(p_{i}'))$ 9 at severity $p_i, p_i'$ 0.

1.4. Structural Sensitivity in Multi-Judge Ranking

Within the heterogeneous judge-aware (HJA) ranking framework (Yu et al., 6 May 2026), each judge $p_i, p_i'$ 1 is assigned a scalar sensitivity parameter $p_i, p_i'$ 2 in the decomposition

$p_i, p_i'$ 3

where $p_i, p_i'$ 4 is the latent score of item $p_i, p_i'$ 5 by judge $p_i, p_i'$ 6, $p_i, p_i'$ 7 the consensus score for item $p_i, p_i'$ 8, and $p_i, p_i'$ 9 the low-rank residual disagreement. Here, $t$ 0 explicitly quantifies how closely a judge's preferences align in strength to the consensus.

2. Experimental Methodologies for Measuring JSS

2.1. Paraphrase Construction and Validation

JudgeSense (Bellibatlu, 26 Apr 2026) built a controlled suite of 494 paraphrase pairs across factuality, coherence, relevance, and preference tasks, ensuring semantic equivalence via an independent LLM validator (GPT-4o-mini). For each judge, JSS was evaluated on deterministic model outputs at $t$ 1, under minimal instruction templates to isolate prompt paraphrasing effects.

2.2. Score-Range Perturbations

Contrastive decoding experiments (Fujinuma, 21 Oct 2025) assessed judge performance across multiple absolute numeric score ranges, using SummEval for coherence and varying both judge and assistant model families. The principal metric was the span in correlation with human scores over all tested ranges.

2.3. Prompt Configuration Sensitivity

Safety benchmarks, exemplified by HarmBench (Zhang, 27 Apr 2026), systematically varied judge prompt structure, framing, and surface rewording in a $t$ 2 factorial design. JSS was equated with the maximum swing (range) in outcome rates—e.g., percentage-point swing in harmful-response rates—across all prompt variants for a given model.

2.4. Controlled Input Corruption

For segmentation quality assessment (Hossain et al., 7 Apr 2026), controlled visual perturbations (e.g., fog, rain, snow, shadow, sunflare) at multiple severity levels were applied. JSS was operationalized as the mean score drop or confidence decline under each perturbation, with paired statistical tests to ensure monotonicity and significance.

2.5. Multi-Judge Ranking Frameworks

The HJA model (Yu et al., 6 May 2026) was fitted to both synthetic and real-world multi-judge pairwise comparison data. The inferential pipeline estimated judge-specific sensitivity parameters ( $t$ 3), consensus rankings, and structured residual disagreement using a constrained maximum likelihood estimator with alternating block updates and subspace anchoring.

3. Empirical Results and Observations

3.1. Magnitude and Interpretation of JSS Across Tasks

On prompt paraphrase sets (Bellibatlu, 26 Apr 2026), coherence tasks exhibited JSS values from 0.389 (Gemini-2.5-flash) up to 0.992 (Claude-Sonnet-4-5), while factuality JSS clustered near 0.63 for all mainstream models, with most of the residual flip rate attributable to a single polarity-inverted template. Pairwise preference and relevance tasks suffered degeneracy (JSS=1.0, but always choosing position A).

3.2. Score-Range Drift

Substantial variation in judge-human agreement (up to 0.115 in Spearman's $t$ 4) was found as a function of score range (Fujinuma, 21 Oct 2025), with contrastive decoding reducing this span by up to 33%.

3.3. Prompt-Induced Instability

For safety benchmarking (Zhang, 27 Apr 2026), prompt wording alone induced harmful-rate swings up to 24.2 percentage points for a fixed judge. Within-condition swings from surface rewording were only modestly smaller (mean $t$ 5 pp), dwarfing the effects of deeper framing or structure.

3.4. Sensitivity in Physical-World Monitor Tasks

In semantic image judgment (Hossain et al., 7 Apr 2026), the judge was highly sensitive to the most severe corruptions (JSS_fog $t$ 63.13 on a 5-point scale), with clear statistical monotonicity. Lesser corruptions produced corresponding lower, but significant, JSS responses.

3.5. Consensus Sensitivities in Ranking

Within HJA (Yu et al., 6 May 2026), estimated judge $t$ 7 spanned from 0.8 to 1.5 across real panels, with higher values marking strong consensus-tracking models and lower values indicating idiosyncratic (or less informative) judges.

4. Design Choices and Artifacts Affecting JSS

4.1. Prompt Template Conventions

Empirical studies found that rubric ordering and score identifier format can induce or suppress scoring bias (Li et al., 27 Jun 2025). For advanced judges, descending rubric order occasionally improved correlation relative to standard (ascending) order or randomization; reference-answer anchoring was the most powerful bias, predictably shifting the score distribution.

4.2. Model Hyperparameters and Decoding

Parameter choices at inference, such as temperature, max tokens, and prompt context, can introduce additional stochasticity or truncation effects, notably affecting observed JSS in resource-constrained or API-limited settings (Bellibatlu, 26 Apr 2026).

4.3. Position and Polarity Biases

Position bias in pairwise preference/relevance tasks can render JSS artificially high but degenerate, hiding systematic bias toward a fixed output (Bellibatlu, 26 Apr 2026). Polarity-inverted templates drive uniform flips in factuality, confounding genuine model sensitivity with template artifacts.

4.4. Family-Specific Architectural Effects

Idiosyncratic sensitivities can persist even within a model family, with no reliable correlation between parameter count and JSS (Bellibatlu, 26 Apr 2026).

5. Mitigation Strategies and Best Practices

5.1. Prompt Ensembling and Averaging

Averaging results over multiple semantically equivalent prompts minimizes prompt-specific noise, leading to more robust aggregate JSS (Li et al., 27 Jun 2025, Zhang, 27 Apr 2026).

5.2. Contrastive Decoding

Subtracting assistant model logits at each decoding step attenuates score-range bias, flattening out JSS and yielding higher mean judge-human agreement (Fujinuma, 21 Oct 2025).

5.3. Direct Reporting and Transparency

Papers recommend routine reporting of JSS or its empirical analogues—prompt sensitivity ranges, flip rates, confidence intervals—alongside standard accuracy, and public release of all judge prompt variants and configurations (Bellibatlu, 26 Apr 2026, Zhang, 27 Apr 2026).

5.4. Calibration and Model Selection

JSS should inform judge model choice: a threshold of JSS $t$ 8 is suggested for productionized pipelines expected to encounter cross-team prompt diversity (Bellibatlu, 26 Apr 2026).

5.5. Statistical Inference in Multi-Judge Settings

For HJA-type frameworks, estimates of $t$ 9 should include valid confidence intervals, with high-leverage diagnostics used to flag systematic disagreement or outlier judges (Yu et al., 6 May 2026).

6. Limitations and Open Challenges

6.1. Confounded Task Setups

In factuality and preference tasks, JSS is frequently confounded by template artifacts (polarity, position) rather than reflecting isolated model sensitivity. This limits the interpretability of stability metrics in current benchmark configurations (Bellibatlu, 26 Apr 2026).

6.2. Language and Generality

Current JSS metrics and paraphrase sets are primarily English-only and templated; extension to multilingual and less-structured prompts remains a major gap (Bellibatlu, 26 Apr 2026).

6.3. Higher-Order and Cross-Judge Effects

Most reported JSS metrics are single-model, single-task. Systematic multi-judge analysis, especially cross-family or hybrid-panel settings as in HJA (Yu et al., 6 May 2026), is in its early stages.

6.4. Stochasticity Beyond T=0

JSS as reported is typically under deterministic (temperature-zero) decoding. Evaluating stability under sampling, real-world API settings, or in the presence of adversarial paraphrases is an open frontier (Bellibatlu, 26 Apr 2026).

7. Summary Table: JSS Formulations and Reporting Contexts

JSS Variant	Mathematical Definition	Primary Reference
Paraphrase JSS	$\delta(a, b) = 1$ 0	(Bellibatlu, 26 Apr 2026)
Range-Span JSS	$\delta(a, b) = 1$ 1	(Fujinuma, 21 Oct 2025)
Score-Drop JSS	$\delta(a, b) = 1$ 2	(Hossain et al., 7 Apr 2026)
HJA Sensitivity	$\delta(a, b) = 1$ 3 in $\delta(a, b) = 1$ 4	(Yu et al., 6 May 2026)

These operationalizations measure related but distinct sensitivities: to prompt wording, score-range, input corruption, and consensus strength, respectively. Each is appropriate in its canonical setting and responds differently to model architecture, prompt design, and downstream aggregation methodology.

Markdown Report Issue Upgrade to Chat

References (6)

JudgeSense: A Benchmark for Prompt Sensitivity in LLM-as-a-Judge Systems (2026)

Contrastive Decoding Mitigates Score Range Bias in LLM-as-a-Judge (2025)

LLM-as-Judge for Semantic Judging of Powerline Segmentation in UAV Inspection (2026)

Heterogeneous Judge-Aware Ranking with Sensitivity, Disagreement, and Confidence (2026)

How Sensitive Are Safety Benchmarks to Judge Configuration Choices? (2026)

Evaluating Scoring Bias in LLM-as-a-Judge (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Judge Sensitivity Score (JSS).

Judge Sensitivity Score (JSS)

1. Formal Definitions of Judge Sensitivity Score

1.1. Paraphrase Agreement Metric

1.2. Score-Range Sensitivity

1.3. Score-Drop Sensitivity to Perturbations

1.4. Structural Sensitivity in Multi-Judge Ranking

2. Experimental Methodologies for Measuring JSS

2.1. Paraphrase Construction and Validation

2.2. Score-Range Perturbations

2.3. Prompt Configuration Sensitivity

2.4. Controlled Input Corruption

2.5. Multi-Judge Ranking Frameworks

3. Empirical Results and Observations

3.1. Magnitude and Interpretation of JSS Across Tasks

3.2. Score-Range Drift

3.3. Prompt-Induced Instability

3.4. Sensitivity in Physical-World Monitor Tasks

3.5. Consensus Sensitivities in Ranking

4. Design Choices and Artifacts Affecting JSS

4.1. Prompt Template Conventions

4.2. Model Hyperparameters and Decoding

4.3. Position and Polarity Biases

4.4. Family-Specific Architectural Effects

5. Mitigation Strategies and Best Practices

5.1. Prompt Ensembling and Averaging

5.2. Contrastive Decoding

5.3. Direct Reporting and Transparency

5.4. Calibration and Model Selection

5.5. Statistical Inference in Multi-Judge Settings

6. Limitations and Open Challenges

6.1. Confounded Task Setups

6.2. Language and Generality

6.3. Higher-Order and Cross-Judge Effects

6.4. Stochasticity Beyond T=0

7. Summary Table: JSS Formulations and Reporting Contexts

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Judge Sensitivity Score (JSS)

1. Formal Definitions of Judge Sensitivity Score

1.1. Paraphrase Agreement Metric

1.2. Score-Range Sensitivity

1.3. Score-Drop Sensitivity to Perturbations

1.4. Structural Sensitivity in Multi-Judge Ranking

2. Experimental Methodologies for Measuring JSS

2.1. Paraphrase Construction and Validation

2.2. Score-Range Perturbations

2.3. Prompt Configuration Sensitivity

2.4. Controlled Input Corruption

2.5. Multi-Judge Ranking Frameworks

3. Empirical Results and Observations

3.1. Magnitude and Interpretation of JSS Across Tasks

3.2. Score-Range Drift

3.3. Prompt-Induced Instability

3.4. Sensitivity in Physical-World Monitor Tasks

3.5. Consensus Sensitivities in Ranking

4. Design Choices and Artifacts Affecting JSS

4.1. Prompt Template Conventions

4.2. Model Hyperparameters and Decoding

4.3. Position and Polarity Biases

4.4. Family-Specific Architectural Effects

5. Mitigation Strategies and Best Practices

5.1. Prompt Ensembling and Averaging

5.2. Contrastive Decoding

5.3. Direct Reporting and Transparency

5.4. Calibration and Model Selection

5.5. Statistical Inference in Multi-Judge Settings

6. Limitations and Open Challenges

6.1. Confounded Task Setups

6.2. Language and Generality

6.3. Higher-Order and Cross-Judge Effects

6.4. Stochasticity Beyond T=0

7. Summary Table: JSS Formulations and Reporting Contexts

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research