Advancing Reward Models for Code Generation

Develop reward models for code generation that approach human-level perception and reasoning so they can reliably assess and align model-generated code with human preferences, overcoming the limitations of current reward models.

Background

The paper introduces BigCodeArena, an open human-in-the-loop platform for evaluating code generation via execution, and releases two associated benchmarks: BigCodeReward, which measures alignment between reward models and human preferences, and AutoCodeArena, which automates pairwise comparisons using LLM-as-a-Judge.

Empirical results show that while execution outputs often improve judging accuracy, many reward-model-based or LLM-as-a-Judge evaluators remain unstable or insufficiently robust, especially when incorporating multimodal signals such as screenshots and logs. This motivates the need for stronger, more perceptive reward models that better capture human judgments about correctness, functionality, and UI/UX in execution-grounded coding tasks.

References

Finally, advancing reward models for code generation remains an open challenge, as current systems still fall short of human-level perception and reasoning; better reward models will, in turn, support the development of more capable and aligned code LLMs.

BigCodeArena: Unveiling More Reliable Human Preferences in Code Generation via Execution (2510.08697 - Zhuo et al., 9 Oct 2025) in Future Work (Section)