Judge Sensitivity Score (JSS)
- Judge Sensitivity Score (JSS) is a metric that quantifies the consistency and robustness of LLM-based judges by evaluating output stability under prompt, configuration, and parameter variations.
- Multiple formulations of JSS assess aspects such as paraphrase agreement, score-range variability, input corruption effects, and sensitivity parameters in multi-judge ranking frameworks.
- Empirical studies reveal that JSS values vary widely across tasks and settings, informing best practices for prompt design, calibration, and transparent evaluation methodologies.
A Judge Sensitivity Score (JSS) quantifies the degree to which an automated judge—typically a LLM acting as an evaluator—produces consistent or robust scoring under superficial changes to its prompts, configuration, and evaluation context. Multiple independent research threads have developed distinct operationalizations of JSS: (1) as an empirical stability metric over paraphrased or perturbed prompts, (2) as a score-range variance measure, (3) as a semantic change-detection statistic, and (4) as an explicit model parameter in multi-judge ranking frameworks. These variants share a common goal: to expose and quantify the sensitivity or variability of LLM-based judges beyond raw accuracy. This article synthesizes technical definitions, empirical practices, and inferential interpretations of JSS as documented in the literature.
1. Formal Definitions of Judge Sensitivity Score
1.1. Paraphrase Agreement Metric
The most direct definition, from "JudgeSense: A Benchmark for Prompt Sensitivity in LLM-as-a-Judge Systems" (Bellibatlu, 26 Apr 2026), formalizes JSS for a judge on a set of paraphrase pairs as: where are semantically equivalent prompts (paraphrases) for the same task , and if , else $0$. Thus, JSS ranges from $0$ (fully inconsistent) to $1$ (perfectly stable under paraphrase), directly reporting the fraction of pairs for which the judge yields matching outputs.
1.2. Score-Range Sensitivity
In direct-assessment tasks, JSS quantifies the amplitude of performance drift when varying the allowed numeric range for scores. Let 0 denote the family of candidate score ranges (e.g., 1–2, 3–4, 5–6, 7–8). If 9 is the correlation (typically Spearman's 0 or Pearson's 1) between LLM scores and human reference under range 2, JSS is defined as:
- Range-Span Form: 3
- Std-Dev Form: 4, with 5 the mean correlation.
A lower JSS indicates more stable agreement across range choices (Fujinuma, 21 Oct 2025).
1.3. Score-Drop Sensitivity to Perturbations
For perturbative robustness analysis, JSS is computed as the mean drop in judge score following deliberate input corruptions (as in LLM judgment of image segmentation quality (Hossain et al., 7 Apr 2026)): 6 where 7 is the judge's score on the uncorrupted instance, and 8 is the score after corruption family 9 at severity 0.
1.4. Structural Sensitivity in Multi-Judge Ranking
Within the heterogeneous judge-aware (HJA) ranking framework (Yu et al., 6 May 2026), each judge 1 is assigned a scalar sensitivity parameter 2 in the decomposition
3
where 4 is the latent score of item 5 by judge 6, 7 the consensus score for item 8, and 9 the low-rank residual disagreement. Here, 0 explicitly quantifies how closely a judge's preferences align in strength to the consensus.
2. Experimental Methodologies for Measuring JSS
2.1. Paraphrase Construction and Validation
JudgeSense (Bellibatlu, 26 Apr 2026) built a controlled suite of 494 paraphrase pairs across factuality, coherence, relevance, and preference tasks, ensuring semantic equivalence via an independent LLM validator (GPT-4o-mini). For each judge, JSS was evaluated on deterministic model outputs at 1, under minimal instruction templates to isolate prompt paraphrasing effects.
2.2. Score-Range Perturbations
Contrastive decoding experiments (Fujinuma, 21 Oct 2025) assessed judge performance across multiple absolute numeric score ranges, using SummEval for coherence and varying both judge and assistant model families. The principal metric was the span in correlation with human scores over all tested ranges.
2.3. Prompt Configuration Sensitivity
Safety benchmarks, exemplified by HarmBench (Zhang, 27 Apr 2026), systematically varied judge prompt structure, framing, and surface rewording in a 2 factorial design. JSS was equated with the maximum swing (range) in outcome rates—e.g., percentage-point swing in harmful-response rates—across all prompt variants for a given model.
2.4. Controlled Input Corruption
For segmentation quality assessment (Hossain et al., 7 Apr 2026), controlled visual perturbations (e.g., fog, rain, snow, shadow, sunflare) at multiple severity levels were applied. JSS was operationalized as the mean score drop or confidence decline under each perturbation, with paired statistical tests to ensure monotonicity and significance.
2.5. Multi-Judge Ranking Frameworks
The HJA model (Yu et al., 6 May 2026) was fitted to both synthetic and real-world multi-judge pairwise comparison data. The inferential pipeline estimated judge-specific sensitivity parameters (3), consensus rankings, and structured residual disagreement using a constrained maximum likelihood estimator with alternating block updates and subspace anchoring.
3. Empirical Results and Observations
3.1. Magnitude and Interpretation of JSS Across Tasks
On prompt paraphrase sets (Bellibatlu, 26 Apr 2026), coherence tasks exhibited JSS values from 0.389 (Gemini-2.5-flash) up to 0.992 (Claude-Sonnet-4-5), while factuality JSS clustered near 0.63 for all mainstream models, with most of the residual flip rate attributable to a single polarity-inverted template. Pairwise preference and relevance tasks suffered degeneracy (JSS=1.0, but always choosing position A).
3.2. Score-Range Drift
Substantial variation in judge-human agreement (up to 0.115 in Spearman's 4) was found as a function of score range (Fujinuma, 21 Oct 2025), with contrastive decoding reducing this span by up to 33%.
3.3. Prompt-Induced Instability
For safety benchmarking (Zhang, 27 Apr 2026), prompt wording alone induced harmful-rate swings up to 24.2 percentage points for a fixed judge. Within-condition swings from surface rewording were only modestly smaller (mean 5 pp), dwarfing the effects of deeper framing or structure.
3.4. Sensitivity in Physical-World Monitor Tasks
In semantic image judgment (Hossain et al., 7 Apr 2026), the judge was highly sensitive to the most severe corruptions (JSS_fog63.13 on a 5-point scale), with clear statistical monotonicity. Lesser corruptions produced corresponding lower, but significant, JSS responses.
3.5. Consensus Sensitivities in Ranking
Within HJA (Yu et al., 6 May 2026), estimated judge 7 spanned from 0.8 to 1.5 across real panels, with higher values marking strong consensus-tracking models and lower values indicating idiosyncratic (or less informative) judges.
4. Design Choices and Artifacts Affecting JSS
4.1. Prompt Template Conventions
Empirical studies found that rubric ordering and score identifier format can induce or suppress scoring bias (Li et al., 27 Jun 2025). For advanced judges, descending rubric order occasionally improved correlation relative to standard (ascending) order or randomization; reference-answer anchoring was the most powerful bias, predictably shifting the score distribution.
4.2. Model Hyperparameters and Decoding
Parameter choices at inference, such as temperature, max tokens, and prompt context, can introduce additional stochasticity or truncation effects, notably affecting observed JSS in resource-constrained or API-limited settings (Bellibatlu, 26 Apr 2026).
4.3. Position and Polarity Biases
Position bias in pairwise preference/relevance tasks can render JSS artificially high but degenerate, hiding systematic bias toward a fixed output (Bellibatlu, 26 Apr 2026). Polarity-inverted templates drive uniform flips in factuality, confounding genuine model sensitivity with template artifacts.
4.4. Family-Specific Architectural Effects
Idiosyncratic sensitivities can persist even within a model family, with no reliable correlation between parameter count and JSS (Bellibatlu, 26 Apr 2026).
5. Mitigation Strategies and Best Practices
5.1. Prompt Ensembling and Averaging
Averaging results over multiple semantically equivalent prompts minimizes prompt-specific noise, leading to more robust aggregate JSS (Li et al., 27 Jun 2025, Zhang, 27 Apr 2026).
5.2. Contrastive Decoding
Subtracting assistant model logits at each decoding step attenuates score-range bias, flattening out JSS and yielding higher mean judge-human agreement (Fujinuma, 21 Oct 2025).
5.3. Direct Reporting and Transparency
Papers recommend routine reporting of JSS or its empirical analogues—prompt sensitivity ranges, flip rates, confidence intervals—alongside standard accuracy, and public release of all judge prompt variants and configurations (Bellibatlu, 26 Apr 2026, Zhang, 27 Apr 2026).
5.4. Calibration and Model Selection
JSS should inform judge model choice: a threshold of JSS 8 is suggested for productionized pipelines expected to encounter cross-team prompt diversity (Bellibatlu, 26 Apr 2026).
5.5. Statistical Inference in Multi-Judge Settings
For HJA-type frameworks, estimates of 9 should include valid confidence intervals, with high-leverage diagnostics used to flag systematic disagreement or outlier judges (Yu et al., 6 May 2026).
6. Limitations and Open Challenges
6.1. Confounded Task Setups
In factuality and preference tasks, JSS is frequently confounded by template artifacts (polarity, position) rather than reflecting isolated model sensitivity. This limits the interpretability of stability metrics in current benchmark configurations (Bellibatlu, 26 Apr 2026).
6.2. Language and Generality
Current JSS metrics and paraphrase sets are primarily English-only and templated; extension to multilingual and less-structured prompts remains a major gap (Bellibatlu, 26 Apr 2026).
6.3. Higher-Order and Cross-Judge Effects
Most reported JSS metrics are single-model, single-task. Systematic multi-judge analysis, especially cross-family or hybrid-panel settings as in HJA (Yu et al., 6 May 2026), is in its early stages.
6.4. Stochasticity Beyond T=0
JSS as reported is typically under deterministic (temperature-zero) decoding. Evaluating stability under sampling, real-world API settings, or in the presence of adversarial paraphrases is an open frontier (Bellibatlu, 26 Apr 2026).
7. Summary Table: JSS Formulations and Reporting Contexts
| JSS Variant | Mathematical Definition | Primary Reference |
|---|---|---|
| Paraphrase JSS | 0 | (Bellibatlu, 26 Apr 2026) |
| Range-Span JSS | 1 | (Fujinuma, 21 Oct 2025) |
| Score-Drop JSS | 2 | (Hossain et al., 7 Apr 2026) |
| HJA Sensitivity | 3 in 4 | (Yu et al., 6 May 2026) |
These operationalizations measure related but distinct sensitivities: to prompt wording, score-range, input corruption, and consensus strength, respectively. Each is appropriate in its canonical setting and responds differently to model architecture, prompt design, and downstream aggregation methodology.