Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

StructCoder: Structure-Aware Transformer for Code Generation (2206.05239v3)

Published 10 Jun 2022 in cs.LG and cs.SE

Abstract: There has been a recent surge of interest in automating software engineering tasks using deep learning. This paper addresses the problem of code generation, where the goal is to generate target code given source code in a different language or a natural language description. Most state-of-the-art deep learning models for code generation use training strategies primarily designed for natural language. However, understanding and generating code requires a more rigorous comprehension of the code syntax and semantics. With this motivation, we develop an encoder-decoder Transformer model where both the encoder and decoder are explicitly trained to recognize the syntax and data flow in the source and target codes, respectively. We not only make the encoder structure-aware by leveraging the source code's syntax tree and data flow graph, but we also support the decoder in preserving the syntax and data flow of the target code by introducing two novel auxiliary tasks: AST (Abstract Syntax Tree) paths prediction and data flow prediction. To the best of our knowledge, this is the first work to introduce a structure-aware Transformer decoder that models both syntax and data flow to enhance the quality of generated code. The proposed StructCoder model achieves state-of-the-art performance on code translation and text-to-code generation tasks in the CodeXGLUE benchmark, and improves over baselines of similar size on the APPS code generation benchmark. Our code is publicly available at https://github.com/reddy-lab-code-research/StructCoder/.

This paper presents a Transformer-based model designed specifically to generate code while understanding its inherent structure. Instead of treating code as an ordinary text sequence, the work leverages both the syntax of the code and its underlying data flow information to improve the quality of generated programs.

Background and Motivation

Software developers often work with programming languages that have very rigid rules and defined structures. Even a small error can lead to incorrect functionality, so it is important that tools which generate code understand not just the tokens but also the organization of the code. Traditional natural language generation models have been successful in text tasks but are not fully suitable for code because they do not take into account the structure provided by the code’s abstract syntax tree (AST) and data flow graph (DFG). This paper is important because it addresses these shortcomings by designing a model that is aware of the structure inherent in source and target codes.

Main Approach

The proposed model uses an encoder-decoder Transformer architecture with several novel modifications:

  • Structure-Aware Encoder:
    • The original code tokens
    • AST leaves (which capture the syntactic structure)
    • DFG variables (which capture how data moves through the code)
    • Instead of using only standard positional embeddings, the encoder computes representations for AST leaves by combining node-type embeddings with node-height embeddings along the path from the root of the AST to each leaf. Furthermore, self-attention is modified in a structure-aware manner. For example:
    • When computing attention between code tokens, standard query–key methods are enhanced with relative position information.
    • For AST leaves, the model calculates a similarity score based on the common nodes along two leaves’ paths. This similarity score is added to the attention scores.
    • For DFG variables, the attention is computed only when there is a connection (an edge) between the corresponding variables in the DFG.
  • Structure-Aware Decoder:
    • Predicts the types of the nodes along the root-leaf path of the AST for the generated token (called AST Paths Prediction or APP).
    • Predicts whether there is a data flow relation between the generated token and the previous tokens (called Data Flow Prediction or DFP).
  • Pretraining Strategy:

The model is pretrained with a denoising autoencoding objective that not only corrupts spans of code but also randomly drops parts of the AST and DFG information. The goal is for the model to learn to reconstruct the original code along with its structural details. This pretraining helps the network understand both the syntax and the semantics of code even before it is fine-tuned on specific tasks.

Experimental Evaluation

The paper demonstrates that incorporating code structure leads to significant performance improvements over previous models on several benchmarks:

  • Code Translation:

For tasks translating code from one programming language to another (for example, Java to C# and vice versa), the structure-aware model achieves higher scores on standard metrics. These metrics include BLEU (which measures n-gram overlap) and specialized metrics that account for the correctness of syntax and data flow in the generated code.

  • Text-to-Code Generation:

When generating code from natural language descriptions, the model outperforms other encoder-decoder systems on benchmarks such as CodeXGLUE. Although generating code from text is particularly challenging due to the ambiguity in natural language, the structure-aware approach helps the model produce syntactically correct and semantically coherent code.

  • Ablation Studies and Analysis:

The paper carefully tests the contribution of each component. It shows that adding either the DFG or AST components individually improves the generated code, while combining them leads to the best performance. The auxiliary tasks in the decoder help guide the generation process without adding extra tokens during inference.

Key Technical Details

  • The encoder rearranges the input by concatenating special symbols, code tokens, AST leaves, and DFG variables.
  • For AST leaf embeddings, a summation is performed over the embeddings of the nodes along the path from the root to the leaf. If you imagine a leaf with a path represented by nodes r₁, r₂, ..., rₙ, its embedding E(l) is computed as

E(l)=i=1nEtype(ri).Eheight(ni)E(l) = \sum_{i=1}^{n} E_{type}(r_i).\odot E_{height}(n - i)

where Eₜₚₑ(·) captures the syntactic role and Eₕₑᵢ₉ₕₜ(·) captures the position relative to the leaf.

  • The decoder uses extra linear layers for predicting node types along the AST paths and another for estimating the probability of data flow links between tokens. These predictions are integrated into the overall loss, along with the standard LLMing loss for code generation.

Results and Impact

The structure-aware model consistently outperforms baseline systems that treat code as plain text. By explicitly modeling syntax and data flow:

  • It produces code that is more likely to compile and execute correctly.
  • It better replicates the refined structure of the target programming language.
  • It provides evidence that integrating domain-specific structures into Transformer architectures can lead to better performance in specialized generation tasks.

Conclusion

This work makes a strong case for incorporating structure into code generation. By modifying both the encoder and the decoder so that they take into account the abstract syntax tree and the data flow graph, the approach produces more syntactically and semantically precise code. The detailed experimental results show improvements in multiple metrics and across different tasks, suggesting that future research in automated code generation will benefit from a structure-aware perspective.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Sindhu Tipirneni (9 papers)
  2. Ming Zhu (117 papers)
  3. Chandan K. Reddy (64 papers)
Citations (41)
X Twitter Logo Streamline Icon: https://streamlinehq.com