Robustness of synthetically generated rubric criteria as scoring anchors
Determine whether the synthetically generated, non–expert-verified rubric criteria used in DeepResearch-Bench RACE are robust anchors for scoring large language model responses.
References
It also falls short due to synthetically generated criteria that were not verified by expert humans---meaning that it's unclear if such criteria are robust anchors for scoring.
— ProfBench: Multi-Domain Rubrics requiring Professional Knowledge to Answer and Judge
(2510.18941 - Wang et al., 21 Oct 2025) in Introduction, Section 1