Metrics Prioritizing Factual Correctness in LLM-based Evaluation

Develop evaluation metrics for LLM-as-a-Judge that prioritize factual correctness over stylistic qualities such as fluency or rhetorical structure.

Background

LLM judges may overvalue stylistic features and verbosity, granting higher scores to well-written but factually incorrect responses. This creates opportunities for adversaries to exploit style over substance.

The paper urges the design of metrics that explicitly reward factual accuracy to improve the reliability of automated judgments.

References

The open research problems in this context are: Develop evaluation metrics that give priority for factual correctness.

Security in LLM-as-a-Judge: A Comprehensive SoK  (2603.29403 - Masoud et al., 31 Mar 2026) in Section 7.3, Length and Style Bias Exploitation (Challenges and Open Problems)