Demonstrating critic-assisted judge accuracy in competitive programming without execution
Demonstrate whether presenting a single sampled large language model critic output can reliably improve human judges’ accuracy in selecting the passing solution between paired competitive-programming submissions without access to code execution, and characterize the conditions required for such assistance to yield measurable gains.
References
With enough sampling we could find cases where this binary task was challenging for humans, but could not produce a critic that helped here with one sampled critique.
— LLM Critics Help Catch LLM Bugs
(2407.00215 - McAleese et al., 28 Jun 2024) in Appendix, “Lessons from Judge Accuracy and Ground Truth Reward Experiments,” Figure “appendix_judge_accuracy” caption