S*: Test Time Scaling for Code Generation (2502.14382v1)

Published 20 Feb 2025 in cs.LG and cs.AI

Abstract: Increasing test-time compute for LLMs shows promise across domains but remains underexplored in code generation, despite extensive study in math. In this paper, we propose S*, the first hybrid test-time scaling framework that substantially improves the coverage and selection accuracy of generated code. S* extends the existing parallel scaling paradigm with sequential scaling to push performance boundaries. It further leverages a novel selection mechanism that adaptively generates distinguishing inputs for pairwise comparison, combined with execution-grounded information to robustly identify correct solutions. We evaluate across 12 LLMs and Large Reasoning Model and show: (1) S* consistently improves performance across model families and sizes, enabling a 3B model to outperform GPT-4o-mini; (2) S* enables non-reasoning models to surpass reasoning models - GPT-4o-mini with S* outperforms o1-preview by 3.7% on LiveCodeBench; (3) S* further boosts state-of-the-art reasoning models - DeepSeek-R1-Distill-Qwen-32B with S* achieves 85.7% on LiveCodeBench, approaching o1 (high) at 88.5%. Code will be available under https://github.com/NovaSky-AI/SkyThought.

Summary

The paper introduces a novel S* framework that integrates parallel generation with sequential debugging using execution feedback.
The S* method improves code generation coverage and boosts selection accuracy through adaptive input synthesis and pairwise comparisons.
Experimental results demonstrate that S* enables smaller and instruction-based models to outperform larger reasoning models on key benchmarks.

The paper "S*: Test Time Scaling for Code Generation" (2502.14382) introduces $S^*$ , a novel hybrid test-time scaling framework designed to enhance the performance of LLMs in code generation. The framework addresses the underexplored area of test-time compute in code generation, distinguishing itself from previous work primarily focused on mathematical reasoning. $S^*$ integrates parallel and sequential scaling strategies to improve code generation coverage and selection accuracy.

S* Framework Overview

$S^*$ operates in two key stages: Generation and Selection. The Generation stage enhances parallel sample generation with sequential scaling through iterative debugging, grounded by execution feedback. The Selection stage incorporates an adaptive input synthesis mechanism to accurately identify the best solution from the generated candidates.

Generation Stage

The Generation stage aims to improve the coverage of potential solutions. It extends parallel sampling with sequential scaling through iterative debugging, leveraging execution feedback. The process begins by generating $N$ initial code samples independently using parallel sampling. Each sample then undergoes up to $R$ rounds of sequential revision, guided by the results of executing the code on public test cases. Outputs and/or error messages from these tests are fed back into the model to iteratively refine the code. The revision process stops when a sample passes all public tests or reaches the maximum number of revision attempts. This stage is crucial for addressing the complexities inherent in code validation compared to other domains.

Selection Stage

The Selection stage focuses on accurately identifying the best solution from the generated candidates, overcoming limitations of existing selection methods like LLM-as-a-judge. The framework employs a novel adaptive input synthesis mechanism. Initially, the LLM synthesizes a set of test inputs. The $N$ generated samples are executed and clustered based on their execution outputs. Pairwise comparisons are made across these clusters. For each comparison, the LLM is prompted to generate distinguishing inputs specifically designed to differentiate between the two samples. These adaptive inputs are executed, and the outputs inform the LLM in selecting the better sample. This execution-grounded approach aims to provide a more robust and accurate sample selection process.

Experimental Results

The paper presents extensive experimental results demonstrating the effectiveness of $S^*$ .

Consistent Performance Improvement: $S^*$ consistently improves performance across a wide range of models, including instruction-based and reasoning models, various model families (e.g., Qwen, DeepSeek), sizes (ranging from 0.5B to 32B parameters), and access types (open and closed source). Evaluations were conducted on LiveCodeBench and CodeContests benchmarks.
Smaller Models Outperforming Larger Models: $S^*$ enables smaller models to surpass larger models within the same family. For example, Qwen2.5-7B-Instruct + $S^*$ outperformed Qwen2.5-32B-Instruct without $S^*$ on LiveCodeBench.
Instruction Models Surpassing Reasoning Models: $S^*$ allows instruction-based models to outperform reasoning models. GPT-4o-mini + $S^*$ surpassed o1-preview.
Competitive Performance: $S^*$ helps open-source reasoning models achieve performance levels competitive with state-of-the-art closed models. DeepSeek-R1-Distill-Qwen-32B + $S^*$ achieved 85.7% on LiveCodeBench, approaching the performance of o1-high (88.7%).
Superiority Over Existing Methods: $S^*$ demonstrably outperforms widely-used test-time scaling methods like self-debugging and majority voting by enhancing both coverage and selection accuracy.

These results collectively suggest that $S^*$ is a general and effective test-time scaling technique that significantly enhances the performance of LLMs for code generation by improving the quality and reliability of generated code through intelligent leveraging of additional compute resources during inference.

Limitations and Future Directions

The paper identifies specific limitations and suggests potential future research directions.

Limitations

Focus on Competition-Level Code Generation: The research primarily focuses on competition-level code generation tasks, neglecting software engineering tasks.
Cost Minimization: The method focuses on improving accuracy without explicitly aiming to minimize computational costs.

Future Directions

Based on the identified limitations, potential future directions include:

Application to Software Engineering Tasks: Investigating the applicability and effectiveness of $S^*$ in more complex software engineering scenarios, such as those evaluated by benchmarks like SWE-bench. This would likely require adapting the iterative debugging and selection mechanisms to handle more complex codebases and tasks.
Cost Optimization: Exploring ways to optimize the computational cost of $S^*$ , potentially by dynamically adjusting the number of parallel samples or the number of iterative debugging rounds based on the problem difficulty or the model's performance.
Improved Reward Models: Although the paper argues that developing a general reward model for code generation remains challenging, further research into reward models could potentially improve the selection stage and reduce the reliance on pairwise comparisons.
Integration with Other Parallel Scaling Techniques: The ablation paper suggested that better in-context learning (ICL) example selection could further improve parallel scaling. Developing more robust ICL techniques and integrating them with $S^*$ is a promising area.
Adaptive Hyperparameter Tuning: Exploring adaptive strategies for tuning the hyperparameters of $S^*$ , such as the temperature, number of samples, and number of debugging rounds, based on the specific model and task.
Theoretical Analysis: Further theoretical analysis of the $S^*$ framework could provide a deeper understanding of its performance characteristics and guide future improvements.