Robustness of synthetically generated rubric criteria as scoring anchors

Determine whether the synthetically generated, non–expert-verified rubric criteria used in DeepResearch-Bench RACE are robust anchors for scoring large language model responses.

Background

The paper critiques DeepResearch-Bench RACE for relying on synthetically generated criteria that were not verified by expert humans. The authors argue that such criteria may introduce bias—particularly favoring Gemini-2.5-Pro because both the evaluation criteria and the reference high-quality answer were produced by that model—and question the reliability of these criteria as stable anchors for fair scoring.

ProfBench is proposed as a remedy, with rubrics curated by domain experts across Physics, Chemistry, Finance, and Consulting. The authors emphasize that robust, expert-validated criteria are essential for fair and effective evaluation, highlighting uncertainty around the robustness of synthetic criteria as a key unresolved question in current benchmark design.

References

It also falls short due to synthetically generated criteria that were not verified by expert humans---meaning that it's unclear if such criteria are robust anchors for scoring.

— ProfBench: Multi-Domain Rubrics requiring Professional Knowledge to Answer and Judge (2510.18941 - Wang et al., 21 Oct 2025) in Introduction, Section 1

Robustness of synthetically generated rubric criteria as scoring anchors

Sponsor

Background

References

Related Problems