Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Pair Programming Framework for Code Generation via Multi-Plan Exploration and Feedback-Driven Refinement (2409.05001v1)

Published 8 Sep 2024 in cs.SE and cs.AI
A Pair Programming Framework for Code Generation via Multi-Plan Exploration and Feedback-Driven Refinement

Abstract: LLMs have achieved impressive performance on code generation. Although prior studies enhanced LLMs with prompting techniques and code refinement, they still struggle with complex programming problems due to rigid solution plans. In this paper, we draw on pair programming practices to propose PairCoder, a novel LLM-based framework for code generation. PairCoder incorporates two collaborative LLM agents, namely a Navigator agent for high-level planning and a Driver agent for specific implementation. The Navigator is responsible for proposing promising solution plans, selecting the current optimal plan, and directing the next iteration round based on execution feedback. The Driver follows the guidance of Navigator to undertake initial code generation, code testing, and refinement. This interleaved and iterative workflow involves multi-plan exploration and feedback-based refinement, which mimics the collaboration of pair programmers. We evaluate PairCoder with both open-source and closed-source LLMs on various code generation benchmarks. Extensive experimental results demonstrate the superior accuracy of PairCoder, achieving relative pass@1 improvements of 12.00%-162.43% compared to prompting LLMs directly.

The paper introduces PairCoder, a novel framework for code generation using LLM agents that mimics pair programming practices. The framework addresses limitations in existing code generation approaches, which often struggle with complex programming problems due to their reliance on single, potentially flawed solution plans. PairCoder employs two collaborative agents: a Navigator and a Driver. The Navigator focuses on high-level planning, generating multiple solution plans, selecting the optimal plan, and directing subsequent iterations based on execution feedback. The Driver focuses on specific code implementation, including initial code generation, testing, and refinement, guided by the Navigator.

The core idea behind PairCoder is to emulate the iterative and adaptive strategies used by human developers, specifically the "multi-plan exploration and practical feedback" cycle. The Navigator agent first reflects on the problem description Q\mathcal{Q} to understand the requirements, constraints, and edge cases. Based on this reflection, the Navigator proposes nn potential solution plans {Pi}i=1n\{P_i\}_{i=1}^n. These plans are then clustered into kk representative candidates SS using text embeddings and the k-means++ algorithm [kmeans]. In each iteration, the Navigator selects the best solution plan planplan from the remaining candidates SS based on correctness, efficiency, and robustness. This plan is then passed to the Driver agent for code generation.

The Driver agent generates the initial code CC based on the selected plan. The generated code is tested against a set of public test cases Tv={(Ii,Oi)}i=1mv\mathcal{T}_v = \big\{(I_i, O_i)\big\}_{i=1}^{m_v}, where IiI_i is the input and OiO_i is the desired output. The execution feedback FF is categorized into four types: Pass, Runtime Error, Wrong Answer, and Time Limit Exceeded. If the code passes all public test cases, the process terminates. Otherwise, the Driver sends the code and execution feedback back to the Navigator.

To avoid getting stuck in a dead-end loop, PairCoder incorporates a long-term memory module that stores the coding history Hc={Ci}i=1r\mathcal{H}_c = \{C^i\}_{i=1}^{r} and the execution feedback history Hf={Fi}i=1r\mathcal{H}_f = \{F^i\}_{i=1}^{r}. The Navigator uses this history to determine whether to change the solution plan or continue refining the current code. If the generated code or execution feedback has already occurred in the past, the Navigator selects a new solution plan from the remaining candidates. Otherwise, the Navigator proposes a repair strategy based on the execution feedback, and the Driver refines the code accordingly. The process continues until the code passes all public test cases or the maximum number of iterations rr is reached.

The time complexity of PairCoder is O(r×c)O(r \times c), where rr is the number of iterations and cc is the cost of operations within each iteration. The space complexity is O(r)O(r), due to the storage of historical coding and execution data.

The paper evaluates PairCoder on five code generation benchmarks: HumanEval, HumanEval+, MBPP, MBPP+, and CodeContest. The results demonstrate that PairCoder achieves superior accuracy compared to competitive baselines, including prompting techniques like CoT [cot] and SCoT, as well as refinement-based approaches like Self-repair, Self-debugging, INTERVENOR, and Reflexion. PairCoder achieves relative pass@1 improvements of 12.00\%--162.43\% compared to direct prompting with LLMs.

The paper analyzes the impact of the maximum number of iterations rr and the number of clusters kk on PairCoder's accuracy. As the iteration count increases, PairCoder exhibits continuous accuracy improvement, outperforming refinement-based baselines, which tend to plateau after a certain number of iterations. A moderate cluster number of k=3k=3 appears to be optimal for PairCoder with the maximum number of iterations r=10r=10.

Ablation studies demonstrate that both multi-plan exploration and feedback-driven refinement contribute to PairCoder's performance, with feedback-driven refinement showing more significant improvements across the benchmarks. Error analysis reveals that Wrong Answer is the most common error type, highlighting the need to improve the functional correctness of code generation. The coverage of public test cases Tv\mathcal{T}_v is found to limit PairCoder's ability to facilitate code generation.

The paper also conducts a cost analysis, measuring the average number of API calls and token consumption per problem. PairCoder requires more API calls than most prompting techniques and some refinement-based approaches, but maintains moderate token consumption, justifying the performance gains.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Huan Zhang (171 papers)
  2. Wei Cheng (175 papers)
  3. Yuhan Wu (32 papers)
  4. Wei Hu (308 papers)
Citations (2)
X Twitter Logo Streamline Icon: https://streamlinehq.com