Trade-off Between Rubric Granularity and Automated Judge Reliance

Investigate the trade-off between fine-grained rubric task decomposition and reliance on an LLM-based automated judge for grading PaperBench submissions; specifically, determine how judge reliability affects the level of rubric granularity required to maintain accurate and trustworthy grading outcomes.

Background

PaperBench uses hierarchical rubrics with hundreds of leaf nodes per paper and an LLM-based judge (SimpleJudge) to grade complex replication submissions. Creating highly detailed rubrics is labor-intensive, but can improve grading precision when judges are imperfect.

The paper notes that as automated judges become more reliable, less granular rubrics may suffice, potentially reducing rubric-writing effort. However, the relationship between judge reliability and necessary rubric granularity—and its impact on grading fidelity—has not been quantified or characterized.

References

We further note that the more reliable your judge is, the less fine-grained your rubric's task decomposition needs to be, potentially reducing the effort necessary for task decomposition as judges become more capable. We leave it to future work to study the trade-off between careful specification and delegation to the judge.

PaperBench: Evaluating AI's Ability to Replicate AI Research  (2504.01848 - Starace et al., 2 Apr 2025) in Appendix A.2, Improving automated judges