Effect of alignment on non-numeric LLM-as-a-judge evaluations

Determine the effect of alignment (e.g., instruction tuning and preference tuning) on LLM-as-a-judge evaluations that use natural-language labels or ranking outputs rather than numerical scores.

Background

The paper analyzes numerical bias in LLM-as-a-judge settings and shows that post-alignment models exhibit increased score concentration, harming evaluation accuracy on regression-style tasks (e.g., MTQE, GECQE, LCP).

All experiments focus on numerical scoring outputs. The authors explicitly note that they did not study evaluations with natural-language labels or rankings, and that how alignment affects such non-numeric evaluation formats remains an unresolved question.

References

This study is limited to the target task using only numerical scores, and the effect of alignment on evaluations with natural language labels and rankings remains unresolved.

Exploring the Effects of Alignment on Numerical Bias in Large Language Models  (2601.16444 - Sato et al., 23 Jan 2026) in Limitations, item (ii)