Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LEVER: Learning to Verify Language-to-Code Generation with Execution (2302.08468v3)

Published 16 Feb 2023 in cs.LG, cs.CL, cs.PL, and cs.SE

Abstract: The advent of LLMs trained on code (code LLMs) has led to significant progress in language-to-code generation. State-of-the-art approaches in this area combine LLM decoding with sample pruning and reranking using test cases or heuristics based on the execution results. However, it is challenging to obtain test cases for many real-world language-to-code applications, and heuristics cannot well capture the semantic features of the execution results, such as data type and value range, which often indicates the correctness of the program. In this work, we propose LEVER, a simple approach to improve language-to-code generation by learning to verify the generated programs with their execution results. Specifically, we train verifiers to determine whether a program sampled from the LLMs is correct or not based on the natural language input, the program itself and its execution results. The sampled programs are reranked by combining the verification score with the LLM generation probability, and marginalizing over programs with the same execution results. On four datasets across the domains of table QA, math QA and basic Python programming, LEVER consistently improves over the base code LLMs(4.6% to 10.9% with code-davinci-002) and achieves new state-of-the-art results on all of them.

Lever: Learning to Verify Language-to-Code Generation with Execution

The paper introduces "Lever," an approach to enhance language-to-code generation using LLMs trained on code (code LLMs) by incorporating a verification step grounded in execution results. Traditional language-to-code methods often combine LLM-generated code with heuristic evaluation using test cases for reranking, yet these test cases are not always available, and heuristics might miss semantic nuances critical for determining code correctness. Lever addresses this by implementing verifiers that assess the correctness of generated programs based on natural language inputs, the code itself, and execution outcomes.

Key Contributions

  1. Verification with Execution: Lever proposes training verifiers that evaluate whether LLM-generated programs are correct, factoring the natural language input, code, and execution results into the decision-making process.
  2. Reranking Framework: The approach integrates the verification score with the LLM generation probability, marginalizing over programs that yield the same execution results, effectively prioritizing code that executes correctly.
  3. Performance: Lever demonstrated consistent improvements across four datasets - encompassing table question answering (QA), math QA, and basic Python programming - achieving execution accuracy improvements ranging from 4.6% to 10.9% using the code-davinci-002 model, setting new state-of-the-art results in these tasks.

Numerical Results

Lever achieved significant accuracy boosts on benchmarks such as Spider, WikiTableQuestions, GSM8k, and MBPP, by leveraging execution-informed reranking. On the Spider dataset, Lever raised the execution accuracy from 75.3% to 81.9%, surpassing both incumbent few-shot and finetuned state-of-the-art models. Similar patterns were observed across other datasets, reinforcing the efficacy of incorporating execution feedback into the language-to-code pipeline.

Implications

Practical: Lever's approach offers a robust framework for improving code synthesis without additional finetuning, valuable for applications where test cases are unfeasible or where computational resources for extensive model finetuning are limited.

Theoretical: This research contributes to the understanding of how semantics gleaned from execution results can directly enhance program synthesis accuracy, bridging gaps in current heuristic-based models.

Future Directions

Further investigation could explore the scalability of Lever across more diverse code languages and architectures. Additionally, leveraging dynamic datasets where the verifier's feedback could iteratively refine LLM generation pathways stands as a promising avenue.

Lever's methodology illustrates the substantial promise of coupling LLM-based generation with informed verification techniques, heralding advancements in how AI models translate human-readable commands into executable code. This paper sets a foundation for future research in enhancing model-driven code generation through semantic verification.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Ansong Ni (17 papers)
  2. Srini Iyer (8 papers)
  3. Dragomir Radev (98 papers)
  4. Ves Stoyanov (15 papers)
  5. Wen-tau Yih (84 papers)
  6. Sida I. Wang (20 papers)
  7. Xi Victoria Lin (39 papers)
Citations (171)
X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com