Lever: Learning to Verify Language-to-Code Generation with Execution
The paper introduces "Lever," an approach to enhance language-to-code generation using LLMs trained on code (code LLMs) by incorporating a verification step grounded in execution results. Traditional language-to-code methods often combine LLM-generated code with heuristic evaluation using test cases for reranking, yet these test cases are not always available, and heuristics might miss semantic nuances critical for determining code correctness. Lever addresses this by implementing verifiers that assess the correctness of generated programs based on natural language inputs, the code itself, and execution outcomes.
Key Contributions
- Verification with Execution: Lever proposes training verifiers that evaluate whether LLM-generated programs are correct, factoring the natural language input, code, and execution results into the decision-making process.
- Reranking Framework: The approach integrates the verification score with the LLM generation probability, marginalizing over programs that yield the same execution results, effectively prioritizing code that executes correctly.
- Performance: Lever demonstrated consistent improvements across four datasets - encompassing table question answering (QA), math QA, and basic Python programming - achieving execution accuracy improvements ranging from 4.6% to 10.9% using the code-davinci-002 model, setting new state-of-the-art results in these tasks.
Numerical Results
Lever achieved significant accuracy boosts on benchmarks such as Spider, WikiTableQuestions, GSM8k, and MBPP, by leveraging execution-informed reranking. On the Spider dataset, Lever raised the execution accuracy from 75.3% to 81.9%, surpassing both incumbent few-shot and finetuned state-of-the-art models. Similar patterns were observed across other datasets, reinforcing the efficacy of incorporating execution feedback into the language-to-code pipeline.
Implications
Practical: Lever's approach offers a robust framework for improving code synthesis without additional finetuning, valuable for applications where test cases are unfeasible or where computational resources for extensive model finetuning are limited.
Theoretical: This research contributes to the understanding of how semantics gleaned from execution results can directly enhance program synthesis accuracy, bridging gaps in current heuristic-based models.
Future Directions
Further investigation could explore the scalability of Lever across more diverse code languages and architectures. Additionally, leveraging dynamic datasets where the verifier's feedback could iteratively refine LLM generation pathways stands as a promising avenue.
Lever's methodology illustrates the substantial promise of coupling LLM-based generation with informed verification techniques, heralding advancements in how AI models translate human-readable commands into executable code. This paper sets a foundation for future research in enhancing model-driven code generation through semantic verification.