Structural Language Models of Code (1910.00577v4)

Published 30 Sep 2019 in cs.LG, cs.PL, and stat.ML

Abstract: We address the problem of any-code completion - generating a missing piece of source code in a given program without any restriction on the vocabulary or structure. We introduce a new approach to any-code completion that leverages the strict syntax of programming languages to model a code snippet as a tree - structural LLMing (SLM). SLM estimates the probability of the program's abstract syntax tree (AST) by decomposing it into a product of conditional probabilities over its nodes. We present a neural model that computes these conditional probabilities by considering all AST paths leading to a target node. Unlike previous techniques that have severely restricted the kinds of expressions that can be generated in this task, our approach can generate arbitrary code in any programming language. Our model significantly outperforms both seq2seq and a variety of structured approaches in generating Java and C# code. Our code, data, and trained models are available at http://github.com/tech-srl/slm-code-generation/ . An online demo is available at http://AnyCodeGen.org .

Authors (4)

Uri Alon (40 papers)
Roy Sadaka (1 paper)
Omer Levy (70 papers)
Eran Yahav (21 papers)

Citations (18)

View on Semantic Scholar

Summary

Structural LLMs of Code

The paper "Structural LLMs of Code," authored by Uri Alon et al., introduces an innovative approach to addressing the any-code completion problem through structural LLMs (SLM) that inherently understand programming syntax. Its primary focus is generating arbitrary pieces of code within a given context, without imposing restrictions on the generated code’s syntax, vocabulary, or structure. This research leverages the abstract syntax trees (ASTs) commonly used to represent code to allow for more flexible and extensive code generation capabilities compared to existing methodologies.

Methodology

The authors propose a structural LLMing (SLM) approach that models code as a tree rather than a sequence, which is standard in NLP. By representing source code as an AST, the model can decompose it into a set of conditional probabilities over its nodes. This is achieved by using a neural model that computes these probabilities by considering all paths in the AST leading to a target node, effectively enabling it to generate the next node in the AST paths. Unlike previous approaches, this method allows the generation of code across various programming languages and constructs.

The model is trained to predict the probability of the program's AST by leveraging information from all paths up to a target node, a significant advancement compared to sequence-based models restricted by linear form and context limitations. The integration of a copy mechanism allows the model to utilize contextually relevant tokens, which is critical in programming where identifiers are often repeated.

Results

The SLM approach was benchmarked against state-of-the-art sequence-to-sequence (seq2seq) and structured LLMs. In the Java any-code completion task, the model achieved an exact-match accuracy@1 of 18.04% and accuracy@5 of 24.83%, surpassing previous models. For the more syntactically constrained task in C#, the model achieved a significant improvement over existing models, with a considerable margin over the former state-of-the-art. An ablation paper further illustrated the importance of joint modeling the input and output, emphasizing that separating encoder and decoder phases detrimentally affects performance in any-code completion.

Implications and Future Directions

This research highlights the potential of structural LLMs in software development tool automation and support, particularly in tasks like code completion, syntax error detection, and auto-generated documentation. By integrating robust code generation techniques, the need for manual specifications is reduced, allowing for more efficient coding practices.

Theoretical implications suggest expanding this approach beyond code generation to include automated debugging and code quality evaluation. The adaptability of SLM to varied programming languages is promising for creating more generalizable solutions in AI-driven software engineering.

Future advancements could explore extending these models further into compiler interactions, potentially employing feedback loops for immediate error correction, and integrating more sophisticated semantic analysis for contextually accurate code generation.

Overall, the paper presents a significant advancement in leveraging machine learning for programming language tasks, offering a scalable solution for code completion and generation across a wide array of contexts and applications.

PDF Markdown

Related Papers

Find Related Papers

GitHub

GitHub - tech-srl/slm-code-generation: TensorFlow code for the neural network presented in the paper: "Structural Language Models of Code" (ICML'2020) (88 stars)