Demonstrating critic-assisted judge accuracy in competitive programming without execution

Demonstrate whether presenting a single sampled large language model critic output can reliably improve human judges’ accuracy in selecting the passing solution between paired competitive-programming submissions without access to code execution, and characterize the conditions required for such assistance to yield measurable gains.

Background

The paper explores whether LLM critics can help human judges distinguish passing from failing solutions in a competitive programming setting where ground truth comes from test suites but annotators do not have execution access. The authors report negative results in their setup, explicitly stating they could not produce a critic that helped with a single sampled critique.

Establishing positive results (or clear conditions under which they can be achieved) would validate critic utility in settings with ground-truth rewards and operational constraints, informing practical deployment of critic assistance for code review and evaluation.

References

With enough sampling we could find cases where this binary task was challenging for humans, but could not produce a critic that helped here with one sampled critique.

— LLM Critics Help Catch LLM Bugs (2407.00215 - McAleese et al., 2024) in Appendix, “Lessons from Judge Accuracy and Ground Truth Reward Experiments,” Figure “appendix_judge_accuracy” caption

Demonstrating critic-assisted judge accuracy in competitive programming without execution

Sponsor

Background

References

Related Problems