Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Satori-SWE: Evolutionary Test-Time Scaling for Sample-Efficient Software Engineering (2505.23604v1)

Published 29 May 2025 in cs.CL, cs.AI, and cs.SE

Abstract: LLMs (LMs) perform well on standardized coding benchmarks but struggle with real-world software engineering tasks such as resolving GitHub issues in SWE-Bench, especially when model parameters are less than 100B. While smaller models are preferable in practice due to their lower computational cost, improving their performance remains challenging. Existing approaches primarily rely on supervised fine-tuning (SFT) with high-quality data, which is expensive to curate at scale. An alternative is test-time scaling: generating multiple outputs, scoring them using a verifier, and selecting the best one. Although effective, this strategy often requires excessive sampling and costly scoring, limiting its practical application. We propose Evolutionary Test-Time Scaling (EvoScale), a sample-efficient method that treats generation as an evolutionary process. By iteratively refining outputs via selection and mutation, EvoScale shifts the output distribution toward higher-scoring regions, reducing the number of samples needed to find correct solutions. To reduce the overhead from repeatedly sampling and selection, we train the model to self-evolve using reinforcement learning (RL). Rather than relying on external verifiers at inference time, the model learns to self-improve the scores of its own generations across iterations. Evaluated on SWE-Bench-Verified, EvoScale enables our 32B model, Satori-SWE-32B, to match or exceed the performance of models with over 100B parameters while using a few samples. Code, data, and models will be fully open-sourced.

This paper introduces Satori-SWE (Zeng et al., 29 May 2025 ), a novel approach for sample-efficient test-time scaling on real-world software engineering (SWE) tasks, specifically resolving GitHub issues. The core problem addressed is that smaller LLMs (< 100B parameters) struggle with these tasks compared to larger models, and existing methods to improve smaller models are either data-intensive (supervised fine-tuning, SFT) or computationally expensive at inference (generating and verifying a large number of samples).

The proposed method, Evolutionary Test-Time Scaling (EvoScale), reframes test-time code patch generation as an evolutionary process. Instead of generating a single large batch of patches and selecting the best, EvoScale generates patches iteratively in smaller batches. In each iteration, a set of generated patches from the previous step serves as "conditioning examples" to guide the generation of the next batch. This iterative refinement process is designed to steer the model's output distribution towards higher-scoring regions in the patch space, thus improving sample efficiency compared to drawing many independent samples.

A key challenge in this evolutionary approach is defining the "mutation" operation. Unlike traditional evolutionary algorithms that add random noise, EvoScale leverages the LLM itself as the mutation operator. The model is trained to generate new patches conditioned on previously generated patches, effectively learning to refine or "mutate" existing candidates.

The training of the code editing model (which performs the patch generation and mutation) involves a two-stage process:

  1. Small-scale Mutation Supervised Fine-Tuning (SFT):
    • Classical SFT: First, a base model is fine-tuned on pairs of (issue, code context) and (Chain-of-Thought, ground truth patch). A larger teacher model is used to generate the CoT traces.
    • Mutation SFT: A separate model (initialized from the same base) is then fine-tuned on (issue, code context, previous patches) and (CoT, ground truth patch). The previous patches are sampled from the classical SFT model's outputs. The teacher model generates CoT conditioned on these previous patches. This stage explicitly teaches the model to take previous attempts into account when generating new ones.
  2. Large-scale Reinforcement Learning (RL) for Self-Evolution:
    • The limitation of SFT is that it doesn't guarantee improvement when conditioning on arbitrary previous outputs. To enable the model to self-evolve (improve its outputs over iterations without an external reward model or unit tests guiding selection), the mutation SFT model is further trained using RL.
    • The RL objective uses a potential-based reward shaping approach. Instead of optimizing for the final reward R(yT)R(y^T), it optimizes for the reward difference between successive iterations: R(x,yt)R(x,yt1)R(x, y^t) - R(x, y^{t-1}), where yty^t is the patch generated at iteration tt and yt1y^{t-1} is a conditioning patch from the previous iteration. This local optimization signal encourages monotonic score improvement across iterations and is theoretically proven to be sufficient for maximizing the final reward.
    • The RL reward includes the potential-based difference, a bonus based on the current patch's score, and a formatting penalty to ensure syntactically and semantically valid code.

The practical implementation of Satori-SWE uses a pipeline-based scaffold consisting of:

  • A Retriever: An LLM that selects relevant files based on the issue and repository structure, optionally refined by a retrieval reward model.
  • A Code Editing Model: The core EvoScale model trained with the two-stage SFT and RL, responsible for generating and refining patches.
  • A Verifier: An optional component (either an LLM-based reward model or unit tests) used during evaluation to select the best patch from the pool of candidates generated across evolution iterations. The RL-trained model's self-evolution capability means this external verifier is not strictly required for the iterative generation process itself during inference, but can be used for final selection.

Experiments on SWE-bench Verified [jimenez2024swebench] with a 32B parameter model (Satori-SWE-32B) demonstrate the effectiveness and sample efficiency of EvoScale.

  • The RL-trained model achieves a greedy (pass@1) accuracy of 35.8%, outperforming other small models.
  • Using evolutionary test-time scaling with a total budget of 50 samples (e.g., 5 samples/iteration for 4 iterations, aggregating results), Satori-SWE-32B achieves a Best@50 accuracy of 41.6%.
  • This Best@50 performance matches the state-of-the-art 70B model Llama3-SWE-RL-70B [swerl], which required Best@500 sampling, demonstrating over 10x improvement in sample efficiency.
  • Comparing runtime, EvoScale (iterative generation) is significantly faster per sample than using unit tests for selection.

The paper highlights that training the model to condition on previous patches (Mutation SFT) is crucial for the evolutionary process, and that RL is essential for achieving self-evolution without relying on external verifiers at inference time. Higher sampling temperature during the mutation step is shown to improve performance by increasing diversity.

Implementation Considerations:

  • Architecture: A pipeline with separate models for retrieval and editing. The editing model itself is capable of iterative refinement.
  • Training Data: Requires curated datasets for both classical SFT (issue, context -> CoT, ground truth) and mutation SFT (issue, context, previous patches -> CoT, ground truth). Leveraging teacher models to generate CoT is key for data scale-up. Reward models require labeled data of (issue, context, patch) pairs with correctness labels.
  • Training Stages: Sequential training: Base model -> Classical SFT -> Mutation SFT -> RL.
  • RL Reward Function: A carefully designed reward combining a base bonus, a potential difference (improvement over previous patches), and format/syntax penalties is necessary to guide self-evolution effectively.
  • Inference: Iterative sampling, conditioning on previous outputs. An optional external verifier can be used for final selection from the aggregated pool of samples.
  • Hardware: Training requires significant GPU resources (e.g., NVIDIA H100). Efficient serving frameworks are needed for inference.

The work primarily focuses on a pipeline-based (agentless) approach, where the model generates a complete patch without runtime interaction. Extending EvoScale to agentic settings, where models interact with the environment via tools and execute code/tests, is noted as a direction for future research. The RL objective currently optimizes local improvement; exploring optimization of cumulative rewards over longer trajectories is also a potential area for future work.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (11)
  1. Guangtao Zeng (14 papers)
  2. Maohao Shen (14 papers)
  3. Delin Chen (8 papers)
  4. Zhenting Qi (19 papers)
  5. Subhro Das (38 papers)
  6. Dan Gutfreund (20 papers)
  7. David Cox (48 papers)
  8. Gregory Wornell (37 papers)
  9. Wei Lu (325 papers)
  10. Zhang-Wei Hong (31 papers)
  11. Chuang Gan (195 papers)
X Twitter Logo Streamline Icon: https://streamlinehq.com