A Pair Programming Framework for Code Generation via Multi-Plan Exploration and Feedback-Driven Refinement (2409.05001v1)

Published 8 Sep 2024 in cs.SE and cs.AI

Abstract: LLMs have achieved impressive performance on code generation. Although prior studies enhanced LLMs with prompting techniques and code refinement, they still struggle with complex programming problems due to rigid solution plans. In this paper, we draw on pair programming practices to propose PairCoder, a novel LLM-based framework for code generation. PairCoder incorporates two collaborative LLM agents, namely a Navigator agent for high-level planning and a Driver agent for specific implementation. The Navigator is responsible for proposing promising solution plans, selecting the current optimal plan, and directing the next iteration round based on execution feedback. The Driver follows the guidance of Navigator to undertake initial code generation, code testing, and refinement. This interleaved and iterative workflow involves multi-plan exploration and feedback-based refinement, which mimics the collaboration of pair programmers. We evaluate PairCoder with both open-source and closed-source LLMs on various code generation benchmarks. Extensive experimental results demonstrate the superior accuracy of PairCoder, achieving relative pass@1 improvements of 12.00%-162.43% compared to prompting LLMs directly.

PDF Abstract

The paper introduces PairCoder, a novel framework for code generation using LLM agents that mimics pair programming practices. The framework addresses limitations in existing code generation approaches, which often struggle with complex programming problems due to their reliance on single, potentially flawed solution plans. PairCoder employs two collaborative agents: a Navigator and a Driver. The Navigator focuses on high-level planning, generating multiple solution plans, selecting the optimal plan, and directing subsequent iterations based on execution feedback. The Driver focuses on specific code implementation, including initial code generation, testing, and refinement, guided by the Navigator.

The core idea behind PairCoder is to emulate the iterative and adaptive strategies used by human developers, specifically the "multi-plan exploration and practical feedback" cycle. The Navigator agent first reflects on the problem description $\mathcal{Q}$ to understand the requirements, constraints, and edge cases. Based on this reflection, the Navigator proposes $n$ potential solution plans $\{P_i\}_{i=1}^n$ . These plans are then clustered into $k$ representative candidates $S$ using text embeddings and the k-means++ algorithm [kmeans]. In each iteration, the Navigator selects the best solution plan $plan$ from the remaining candidates $S$ based on correctness, efficiency, and robustness. This plan is then passed to the Driver agent for code generation.

The Driver agent generates the initial code $C$ based on the selected plan. The generated code is tested against a set of public test cases $\mathcal{T}_v = \big\{(I_i, O_i)\big\}_{i=1}^{m_v}$ , where $I_i$ is the input and $O_i$ is the desired output. The execution feedback $F$ is categorized into four types: Pass, Runtime Error, Wrong Answer, and Time Limit Exceeded. If the code passes all public test cases, the process terminates. Otherwise, the Driver sends the code and execution feedback back to the Navigator.

To avoid getting stuck in a dead-end loop, PairCoder incorporates a long-term memory module that stores the coding history $\mathcal{H}_c = \{C^i\}_{i=1}^{r}$ and the execution feedback history $\mathcal{H}_f = \{F^i\}_{i=1}^{r}$ . The Navigator uses this history to determine whether to change the solution plan or continue refining the current code. If the generated code or execution feedback has already occurred in the past, the Navigator selects a new solution plan from the remaining candidates. Otherwise, the Navigator proposes a repair strategy based on the execution feedback, and the Driver refines the code accordingly. The process continues until the code passes all public test cases or the maximum number of iterations $r$ is reached.

The time complexity of PairCoder is $O(r \times c)$ , where $r$ is the number of iterations and $c$ is the cost of operations within each iteration. The space complexity is $O(r)$ , due to the storage of historical coding and execution data.

The paper evaluates PairCoder on five code generation benchmarks: HumanEval, HumanEval+, MBPP, MBPP+, and CodeContest. The results demonstrate that PairCoder achieves superior accuracy compared to competitive baselines, including prompting techniques like CoT [cot] and SCoT, as well as refinement-based approaches like Self-repair, Self-debugging, INTERVENOR, and Reflexion. PairCoder achieves relative pass@1 improvements of 12.00\%--162.43\% compared to direct prompting with LLMs.

The paper analyzes the impact of the maximum number of iterations $r$ and the number of clusters $k$ on PairCoder's accuracy. As the iteration count increases, PairCoder exhibits continuous accuracy improvement, outperforming refinement-based baselines, which tend to plateau after a certain number of iterations. A moderate cluster number of $k=3$ appears to be optimal for PairCoder with the maximum number of iterations $r=10$ .

Ablation studies demonstrate that both multi-plan exploration and feedback-driven refinement contribute to PairCoder's performance, with feedback-driven refinement showing more significant improvements across the benchmarks. Error analysis reveals that Wrong Answer is the most common error type, highlighting the need to improve the functional correctness of code generation. The coverage of public test cases $\mathcal{T}_v$ is found to limit PairCoder's ability to facilitate code generation.

The paper also conducts a cost analysis, measuring the average number of API calls and token consumption per problem. PairCoder requires more API calls than most prompting techniques and some refinement-based approaches, but maintains moderate token consumption, justifying the performance gains.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Huan Zhang (171 papers)
Wei Cheng (175 papers)
Yuhan Wu (32 papers)
Wei Hu (308 papers)

Citations (2)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/ComputerPapers/status/1833473687967740303