TreeBERT: Tree-Structured Code Model
- TreeBERT is a tree-based pre-trained model that incorporates abstract syntax trees via composition paths and node-position embeddings to capture code hierarchies.
- It employs a hybrid pre-training objective combining Tree Masked Language Modeling and Node Order Prediction to learn both semantic and syntactic code properties.
- Empirical evaluations show TreeBERT outperforms baselines in code summarization and documentation tasks, demonstrating strong cross-language generalization.
TreeBERT is a tree-based pre-trained model designed to enhance generation tasks for programming languages by incorporating syntactic structure via abstract syntax trees (ASTs) (Jiang et al., 2021). Unlike prior work focusing primarily on code tokens or sequential code representations, TreeBERT utilizes AST composition paths and node position embeddings to represent code in a manner reflecting tree hierarchy and sibling ordering. It employs a hybrid pre-training objective, combining Tree Masked Language Modeling (TMLM) and Node Order Prediction (NOP), and achieves improved performance across code summarization and documentation tasks, demonstrating strong transferability to previously unseen programming languages.
1. AST Representation: Composition Paths and Node Position Embedding
TreeBERT parses code into its AST, representing the tree as a set of root-to-leaf composition paths:
where each path consists of non-terminal type nodes and a terminal value node (corresponding to a code token).
Each path is vectorized as:
To recover tree structural information, each node receives a node-position embedding in addition to the type/value embedding. For a node at level with parent position embedding and children, the -th child’s position embedding is:
where are learnable level embeddings. The compact scheme (approximately 1/200th the parameters of learned linear positions) allows extrapolation to arbitrarily large trees.
2. Pre-training Objectives: Tree Masked Language Modeling (TMLM) and Node Order Prediction (NOP)
TreeBERT learns both semantic and syntactic properties of code through two objectives:
2.1 Tree Masked Language Modeling (TMLM)
TMLM exploits tree locality by masking a subset of AST nodes along each path. The masking strategy favors deeper (near-leaf) nodes, assigning scores:
Top- nodes are masked: , producing . In the corresponding code token sequence , only tokens not corresponding to masked AST nodes are masked:
This strategy forces the decoder to utilize tree-structured features. The TMLM loss is:
2.2 Node Order Prediction (NOP)
NOP captures the ordering constraints in ASTs by randomly swapping two nodes in a path with 50% probability. The [CLS] output is passed through a sigmoid to predict whether the path is out-of-order (). Binary cross-entropy loss:
where if shuffled, $0$ otherwise.
2.3 Hybrid Pre-training Objective
TreeBERT is trained jointly via:
with yielding optimal performance.
3. Model Architecture
TreeBERT adopts a standard Transformer encoder-decoder configuration (Vaswani et al. 2017) with key adaptations:
- Encoder: 6 layers, hidden size , 8 attention heads.
- Decoder: identical configuration.
- Composition path vectors projected via a fully connected layer to match hidden size.
- Node types: learned 1024-dimensional vectors.
- Value nodes: BPE subtoken sum representations (32K merge operations).
- Node-position embeddings added to encoder-side embeddings.
- Shared vocabulary and embeddings across Java and Python.
- No additional "structural" attention layers; all tree structure is encoded via path serialization and position embeddings.
4. Pre-training Corpora and Hyperparameters
TreeBERT leverages large-scale, tree-parsed code datasets:
- Python: 7.2M files, 2B tokens, 500M AST paths, 7B AST nodes.
- Java: 14.1M files, 4.5B tokens, 1.1B AST paths, 16.5B AST nodes.
Model limits:
- Max paths per example: 100.
- Max nodes per path: 20.
- Max code length: 200 tokens.
Training specifics:
- Adam optimizer (, ), L2 decay , dropout , GELU activations.
- Warmup over 10% of steps, followed by linear decay of the learning rate.
- Learning rates: ; best rate selected.
- Batch sizes: .
- Masking rates: TMLM—15% of nodes per path; NOP—50% chance to shuffle a path.
- Hybrid loss coefficient .
5. Empirical Evaluation on Code Generation Tasks
TreeBERT is evaluated on code summarization and code documentation tasks, with consistent performance gains over baselines.
5.1 Code Summarization
Task: predict function name subtokens from function body code.
- Datasets: ETH Py150 (Python), Java-small/med/large splits.
- Metrics: Precision, Recall, F1.
Table 1: Code Summarization Results
| Model | Py150 | Java-small | Java-med | Java-large |
|---|---|---|---|---|
| Transformer | 16.13 | 31.41 | 41.22 | 48.13 |
| Graph2Seq | 29.26 | 42.79 | 53.05 | 58.15 |
| Code2Seq | 30.07 | 43.02 | 53.23 | 59.18 |
| Code+Gnn+GRU | 34.45 | 48.16 | 58.81 | 65.64 |
| MASS | 32.35 | 45.71 | 54.83 | 60.79 |
| CuBERT | 26.99 | 38.22 | 45.99 | 50.55 |
| CodeBERT | 29.58 | 41.10 | 49.64 | 54.76 |
| TreeBERT | 39.04 | 51.99 | 61.22 | 67.25 |
TreeBERT yields a 4–13 absolute F1 gain over the best non-pretrained model, and approximately 1–5 points over other pretrained baselines.
5.2 Code Documentation
Task: generate natural language comments for code methods.
- Dataset: DeepCom (Java).
- Metric: BLEU-4.
Table 2: Code Documentation (Java) Results
| Model | BLEU |
|---|---|
| Transformer | 13.32 |
| DeepCom | 14.81 |
| Graph2Seq | 18.39 |
| Code2Seq | 18.48 |
| Code+Gnn+GRU | 19.73 |
| MASS | 18.92 |
| CuBERT | 17.41 |
| CodeBERT | 17.87 |
| TreeBERT | 20.49 |
5.3 Ablation Study
BLEU drops compared to full TreeBERT (Java documentation):
- “No TMLM”: 20.49 → 14.12 (–6.37)
- “No NOP”: 20.49 → 16.71 (–3.78)
- “No Node-Position Embedding”: 20.49 → 20.25 (–0.24)
- “Random Masking in AST”: 20.49 → 14.81 (–5.68)
- “Only Mask Value Nodes”: 20.49 → 18.25 (–2.24)
This suggests the TMLM and NOP objectives and tree-specific masking are essential to TreeBERT’s performance gains.
6. Generalization to Unseen Programming Languages
TreeBERT exhibits language-agnostic generalization. When fine-tuned for code documentation on C# (CodeNN’s StackOverflow QA), TreeBERT outperforms all baselines, despite C# being unseen during pre-training.
Table 3: Generalization to C
| Model | BLEU |
|---|---|
| CodeNN | 14.18 |
| MASS | 16.84 |
| CuBERT | 14.95 |
| CodeBERT | 15.31 |
| TreeBERT | 17.94 |
A plausible implication is that the root→leaf path representation and node-position embeddings confer strong language-agnostic inductive bias via AST structure. TreeBERT’s transfer strength substantiates the utility of tree-based learning for programming LLMs.
7. Significance and Contributions
TreeBERT introduces a principled framework for incorporating tree structure into code representation and pre-training for programming language generation tasks. By linearizing ASTs as composition paths and embedding tree-hierarchical information, combined with syntactic and semantic objectives (TMLM and NOP), TreeBERT advances the state-of-the-art in code summarization and documentation, with empirical evidence of cross-language generalization. The model’s design supports scaling for large datasets and codebases, while ablation studies underscore the critical importance of tree-specific pre-training protocols (Jiang et al., 2021).