Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Natural Language to Code Translation with Execution (2204.11454v2)

Published 25 Apr 2022 in cs.CL and cs.SE

Abstract: Generative models of code, pretrained on large corpora of programs, have shown great success in translating natural language to code (Chen et al., 2021; Austin et al., 2021; Li et al., 2022, inter alia). While these models do not explicitly incorporate program semantics (i.e., execution results) during training, they are able to generate correct solutions for many problems. However, choosing a single correct program from a generated set for each problem remains challenging. In this work, we introduce execution result--based minimum Bayes risk decoding (MBR-EXEC) for program selection and show that it improves the few-shot performance of pretrained code models on natural-language-to-code tasks. We select output programs from a generated candidate set by marginalizing over program implementations that share the same semantics. Because exact equivalence is intractable, we execute each program on a small number of test inputs to approximate semantic equivalence. Across datasets, execution or simulated execution significantly outperforms the methods that do not involve program semantics. We find that MBR-EXEC consistently improves over all execution-unaware selection methods, suggesting it as an effective approach for natural language to code translation. We open-source our code at github.com/facebookresearch/mbr-exec and data at dl.fbaipublicfiles.com/mbr-exec/mbr-exec-release.zip

Natural Language to Code Translation with Execution

The manuscript presents a novel approach for enhancing the translation of natural language into executable code, focusing particularly on leveraging execution results during the selection process of generated code samples. The paper introduces a method termed "execution result--based minimum Bayes risk decoding" (MBR-exec), which optimizes the selection of generated programs by evaluating semantic equivalency via execution results.

Main Contributions

  • Execution-Aware MBR Decoding: The core contribution is the execution-aware selection mechanism. Existing pretrained code models, though successful at generating code from natural language, struggle in selecting the correct program from multiple candidates. By employing the execution-based MBR decoding method, the authors propose a framework where candidate programs are evaluated for semantic equivalence based partly on execution results rather than solely on syntactic features or token sequences.
  • Empirical Evaluation: The paper evaluates MBR-exec across various datasets encompassing different programming languages, including Python, SQL, and Bash, using benchmark tasks such as MBPP and Spider. Results demonstrate significant performance improvements over non-execution-based selection baselines, and a consistent enhancement in selecting correct programs that are of comparable quality to intermediate-level human coders.

Methodology

The approach is bifurcated into two phases:

  1. Sample Collection: Initial programs are generated using a few-shot prompting technique with a unified format. The prompts include text descriptions and optional additional information to orient the model toward the desired functionality.
  2. MBR-exec Decoding: Selected candidate programs are executed on test inputs. The Bayes risk for a program is calculated based on its output matching scores with other samples, aiming to select the sample with the lowest cumulative risk among available alternatives.

Key Findings

  1. Performance Improvement: Across all datasets, MBR-exec exhibited notable improvements in execution accuracy compared to baselines without code execution information or with only likelihood-based selection techniques.
  2. Flexibility and Robustness: While MBR-exec primarily utilizes execution results for selection, it was also found that metrics such as BLEU score could act as effective alternatives when execution is impracticable.
  3. Impact of Sample Size and Temperature: MBR-exec's efficacy increases with sample quantity and is robust across a range of sampling temperatures, with optimal performance generally observed at temperatures below 0.5.

Implications and Future Directions

The paper presents significant implications for AI-assisted coding, suggesting that integrating execution semantics could substantially uplift code generation and selection tasks in software development. Moreover, enhancing model training with execution-awareness could further improve performance, opening avenues for future work. Given the emerging prominence of AI in software engineering, advancements in semantic-understanding models could lead to sophisticated automated code generation tools capable of more accurately interpreting and converting real-world problems described in natural language to high-quality code solutions.

In summary, the proposed MBR-exec methodology addresses key limitations in program selection by utilizing a results-focused approach, demonstrating its potential to improve code generation models' real-world applicability and performance.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Freda Shi (16 papers)
  2. Daniel Fried (69 papers)
  3. Marjan Ghazvininejad (33 papers)
  4. Luke Zettlemoyer (225 papers)
  5. Sida I. Wang (20 papers)
Citations (109)
Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com