Defining a Versatile Reward Model for SWE Across TTS and RL

Determine the defining properties of a versatile reward model for software engineering agents that remains effective across both test-time scaling (TTS) and reinforcement learning (RL), and ascertain whether high TTS performance implies effectiveness in RL or whether TTS and RL impose different requirements on reward models.

Background

The paper investigates execution-free reward models (verifiers) for software engineering agents and observes that two verifiers with nearly identical test-time scaling (TTS) performance can exhibit drastically different behavior during reinforcement learning (RL). This challenges the use of TTS alone as a proxy for verifier quality and motivates a deeper understanding of what properties make a reward model versatile across both TTS and RL.

To address this gap, the authors propose evaluating reward models not only by TTS but also by area under the ROC curve (AUC) to capture discriminative ability and expected calibration error (ECE) to measure calibration. The open question, posed at the outset of Section 3.1, highlights uncertainty about the characteristics that define such a versatile reward model and whether the requirements differ between TTS and RL.

References

As we aim to develop a versatile reward model that can be applied across different scenarios such as TTS and RL, but it is unknown what defines such a versatile reward model and whether TTS and RL impose different requirements.

SWE-RM: Execution-free Feedback For Software Engineering Agents (2512.21919 - Shum et al., 26 Dec 2025) in Section 3.1