Execution-Free Feedback in SWE-RM Verifiers

Updated 29 December 2025

Execution-Free Feedback is a method that leverages learned reward models and formal symbolic tools to evaluate software agents without executing full test suites.
It utilizes transformer-based architectures and mixture-of-experts models to deliver calibrated, continuous scores for candidate solution quality.
By providing fine-grained discrimination and reducing computational costs, it enhances reinforcement learning performance and formal verification pipelines.

A SWE-RM Verifier is a model or software system designed to assess candidate solutions or execution traces in software engineering and systems contexts using outcome-supervised or execution-free reward modeling. The term encompasses deep neural models trained to classify software agent trajectories as success or failure, as well as formal tools that symbolically verify program properties under software or hardware semantics. SWE-RM verifiers have emerged as a critical component in test-time scaling for LLM-based software agents, RL feedback, and formal assurance for system software, enabling efficient, scalable, and fine-grained verification without full test-suite execution or simulation.

1. Core Principles and Motivation

Execution-based verification, such as unit test feedback, provides binary and sparse signals contingent on high-quality, comprehensive test suites. In practice, software engineering environments suffer from incomplete, overly specific, or noisy test suites, which may result in both a lack of signal for RL and insufficient discrimination between “almost correct” and genuinely correct solutions. Execution-free feedback, as implemented by SWE-RM verifiers, returns calibrated, continuous scores for candidate solutions by leveraging learned reward models or formal symbolic reasoning frameworks. This enables:

Fine-grained discrimination between solution qualities.
Dense reward signals that improve RL and data efficiency.
Broad application even where reliable unit tests are unavailable.
Drastic reduction of computational cost compared to executing heavy test suites or full program simulations (Pan et al., 2024, Shum et al., 26 Dec 2025).

2. Model Architectures and Implementations

LLM-Based Reward Models

SWE-RM verifiers instantiated as LLMs use decoder-only Transformer architectures. For example, in SWE-Gym, the verifier is based on Qwen-2.5-Coder-Instruct (32B parameters), mirroring the agent model’s stack of Transformer decoder layers, multi-head self-attention, feed-forward sublayers, layer-norm, residual connections, and RoPE positional embeddings. Both agent and verifier share the vocabulary and tokenizer. Candidate solution evaluation involves the following input format:

Task prefix describing the problem.
Interleaved trajectory $\tau = [o_1, a_1, o_2, a_2, ..., o_n, a_n]$ , with $o_k$ as observations and $a_k$ as actions.
Final query: ‘Did this patch fix the tests? Yes or No.’

The verifier autoregressively generates either “Yes” or “No” as its decision token, which is used to compute a success probability (Pan et al., 2024).

Mixture-of-Experts (MoE) Models

SWE-RM, developed for execution-free feedback in TTS and RL, employs a MoE backbone (Qwen3-30B-A3B) with approximately 3B active parameters per token via a gating network that dynamically selects the relevant experts. Inputs comprise the full agent-environment multi-turn trajectory; outputs are logit scores for YES/NO, yielding scalar rewards:

$r = \frac{\exp(l_{\text{YES}})}{\exp(l_{\text{YES}}) + \exp(l_{\text{NO}})}$

The reward model is trained with binary cross-entropy on ground truth labels (success/failure) (Shum et al., 26 Dec 2025).

Formal Software and Hardware Verifiers

For low-level systems such as firmware or concurrency control under weak memory, SWE-RM verifier workflows may integrate SMT-based model checking (e.g., ESBMC), operational semantics (e.g., Maude), or property-based formal analysis. Inputs are source code or abstract actions with properties (invariants, pre/post-conditions); verification proceeds by symbolic execution and constraint solving, generating proofs or counterexamples for safety/security properties (Wu et al., 2024, Colvin et al., 2018).

3. Training, Calibration, and Data Regimes

Supervised Training

SWE-RM verifiers are trained as binary classifiers, minimizing:

$L(\theta) = - \sum_{(\tau, y)} [y \log p_\theta(\text{Yes}|\tau) + (1-y)\log p_\theta(\text{No}|\tau)] + \lambda \|\theta\|^2$

Low-rank adaptation (LoRA) can be used (rank 64 in SWE-Gym), acting as both a regularizer and a memory/computation saver (Pan et al., 2024).

Data Curation

Effective SWE-RM verifiers depend on curate and balanced datasets. In SWE-Gym, positives/negatives are balanced per task. In SWE-RM, approximately 100,000 trajectories are included with mixed sources and a fixed 2:1 ratio of resolved to unresolved examples. Mixing on-policy and off-policy data, as well as a variety of sources, improves robustness, discrimination (AUC), and calibration (ECE) (Pan et al., 2024, Shum et al., 26 Dec 2025).

Calibration

Calibration quality is essential for downstream RL and test-time scaling. Expected Calibration Error (ECE) and reliability diagrams measure the alignment between model scores and actual outcome rates. Larger datasets, appropriately tuned positive/negative ratios, and mixing policy/sources reduce ECE from poor (≈0.48 at 500 examples) to excellent (≈0.07 at 100k). SWE-RM demonstrates threefold better calibration than poorly tuned variants (Shum et al., 26 Dec 2025).

4. Inference-Time Deployment and Integration

At inference, SWE-RM verifiers accept $K$ candidate trajectories $\{\tau_1, \ldots, \tau_K\}$ produced by a software agent. Each trajectory is scored:

$s(\tau_k) = \hat{p}(\text{Yes}|\tau_k) = \frac{\exp(l_{\text{YES}})}{\exp(l_{\text{YES}}) + \exp(l_{\text{NO}})}$

The candidate with the highest score is selected:

$\tau^* = \arg\max_{k=1...K} s(\tau_k)$

This “best-of-K” strategy provides substantial absolute gains in resolve rates (e.g., 10–12 points over single pass, reflecting the logarithmic increase in algorithmic performance with $K$ ) (Pan et al., 2024, Shum et al., 26 Dec 2025).

For scalable RL, the verifier’s reward can directly supervise agent updates using the continuous output, making high calibration and classification accuracy essential; comparable TTS performance alone does not guarantee RL reliability (Shum et al., 26 Dec 2025).

5. Formal Model Checking for Critical Software

For firmware such as Arm CCA’s Realm Management Monitor, SWE-RM verifier workflows rely on advanced SMT-based model checking through tools like ESBMC. The verification pipeline consists of:

Automated parsing (AST generation).
Symbolic execution and SSA construction.
SMT encoding of pre/post-conditions, invariants, and control/data-flow properties.
Counterexample analysis.

Properties include memory isolation, control-flow integrity, privilege separation, and metadata invariants. Model checking uncovers deep bugs (e.g., pointer-to-integer conversion flaws, broken refcount invariants), often missed by other tools (e.g., CBMC). Loop unwinding parameters and multi-property groupings are used for scalability; results are incorporated into CI pipelines for regression guarding and coverage (Wu et al., 2024).

Model checking of concurrent programs under weak memory semantics is also realized in wide-spectrum language frameworks encoded in Maude, systematically exploring all interleavings and reorderings relevant to the hardware model, and confirming program refinement to abstract specifications (Colvin et al., 2018).

6. Experimental Evaluation and Results

LLM-Based SWE-RM Verifiers

On SWE-Bench Verified with OpenHands prompting and K=16 candidate sampling:

Pass@1 (single agent): 20.6%
Oracle Pass@16: 42.8%
Best@16 with the verifier: 32.0%
Best@8: 29.8%
Mix of on-policy and off-policy training yields best results at moderate K; imbalanced data and negative oversampling hurt at low K.

Scaling behavior indicates that increasing K continues to provide performance gains, nearly linear in log-K (Pan et al., 2024).

SWE-RM MoE Models

On SWE-Bench Verified, RM@32:

Verifier (Type)	Qwen3-Flash RM@32	Qwen3-Max RM@32
SWE-Gym (exec-free)	51.2%	65.4%
DeepSWE-EB (exec)	53.2%	66.2%
SWE-RM (exec-free, MoE)	62.0%	74.6%

SWE-RM achieves state-of-the-art RM@32 for open-weight 30B models. In RL, hybrid reward (SWE-RM + execution) attains the smoothest learning and highest pass@1 (Shum et al., 26 Dec 2025).

Formal Verification

SMT-based SWE-RM verification of RMM found 23 new specification violations, outperforming CBMC on selected modules and supporting integration in industrial CI flows. Multi-property ESBMC modes and per-loop bounds further enhance throughput and bug-finding capacity (Wu et al., 2024).

7. Practical Challenges and Best Practices

Context Window: LLM-based SWE-RM verifiers require context windows of at least 32k tokens; for MoE approaches, up to 256k is supported to fully process complex multi-turn trajectories (Pan et al., 2024, Shum et al., 26 Dec 2025).
Regularization: LoRA adapters provide regularization and resource efficiency, at times outperforming full-parameter fine-tuning.
Bias Control: Limiting positives per task and subsampling negatives prevent over-conservative models.
Compatibility: Any agent or tool generating textual action/observation logs can be integrated via the prompt template; the underlying model architecture does not require task-specific modifications.
Verification Automation: For C/firmware, harness generation from specs, systematic application of symbolic execution, and rigorous CI integration with baseline failure regression are recommended for robust scaling (Wu et al., 2024).
Calibration: Consistent monitoring of AUC/ECE, not just pass@k, is critical for models intended to supervise RL (Shum et al., 26 Dec 2025).

SWE-RM verifiers provide foundational infrastructure for scalable, reliable agentic software engineering, enabling rigorous, execution-free feedback in modern testing, RL, and systems verification pipelines.