TreeBERT: Tree-Structured Code Model

Updated 18 January 2026

TreeBERT is a tree-based pre-trained model that incorporates abstract syntax trees via composition paths and node-position embeddings to capture code hierarchies.
It employs a hybrid pre-training objective combining Tree Masked Language Modeling and Node Order Prediction to learn both semantic and syntactic code properties.
Empirical evaluations show TreeBERT outperforms baselines in code summarization and documentation tasks, demonstrating strong cross-language generalization.

TreeBERT is a tree-based pre-trained model designed to enhance generation tasks for programming languages by incorporating syntactic structure via abstract syntax trees (ASTs) (Jiang et al., 2021). Unlike prior work focusing primarily on code tokens or sequential code representations, TreeBERT utilizes AST composition paths and node position embeddings to represent code in a manner reflecting tree hierarchy and sibling ordering. It employs a hybrid pre-training objective, combining Tree Masked Language Modeling (TMLM) and Node Order Prediction (NOP), and achieves improved performance across code summarization and documentation tasks, demonstrating strong transferability to previously unseen programming languages.

1. AST Representation: Composition Paths and Node Position Embedding

TreeBERT parses code into its AST, representing the tree as a set of root-to-leaf composition paths:

$A = \{p_1, p_2, \ldots, p_N\}$

where each path $p_i = v^i_1, v^i_2, \ldots, v^i_{L-1}, x^i_t$ consists of non-terminal type nodes $v^i_j$ and a terminal value node $x^i_t$ (corresponding to a code token).

Each path is vectorized as:

$p_i = \mathrm{Concat}\big[\mathrm{Emb}(v_1^i);\ldots;\mathrm{Emb}(v_{L-1}^i);\mathrm{Emb}(x_t^i)\big]$

To recover tree structural information, each node receives a node-position embedding in addition to the type/value embedding. For a node at level $j$ with parent position embedding $E^{parent}$ and $c$ children, the $i$ -th child’s position embedding is:

$E^{position}_i = \frac{c-i+1}{c+1}E^{parent} + \frac{i}{c+1}W^{level}_{j+1},\quad(1 \leq i \leq c)$

where $W^{level}_{j+1}$ are learnable level embeddings. The compact scheme (approximately 1/200th the parameters of learned linear positions) allows extrapolation to arbitrarily large trees.

2. Pre-training Objectives: Tree Masked Language Modeling (TMLM) and Node Order Prediction (NOP)

TreeBERT learns both semantic and syntactic properties of code through two objectives:

2.1 Tree Masked Language Modeling (TMLM)

TMLM exploits tree locality by masking a subset of AST nodes along each path. The masking strategy favors deeper (near-leaf) nodes, assigning scores:

$q^i_n = \frac{\exp(n-L)}{\sum_{j=1}^L \exp(j-L)}$

Top- $k$ nodes are masked: $m_i^A = \mathrm{TOPK}(p_i, k, \{q^i_n\})$ , producing $A^{masked}$ . In the corresponding code token sequence $C$ , only tokens not corresponding to masked AST nodes are masked:

$m^C = \{x \in C \mid x \notin m^A\},\quad C^{masked} = \mathrm{REPLACE}(C, m^C, [\mathrm{mask}])$

This strategy forces the decoder to utilize tree-structured features. The TMLM loss is:

$\mathcal{L}_{TMLM}(\theta) = -\frac{1}{M} \sum_{t=1}^M \log P(x_t \mid x_{<t},\,A^{masked})$

2.2 Node Order Prediction (NOP)

NOP captures the ordering constraints in ASTs by randomly swapping two nodes in a path with 50% probability. The [CLS] output is passed through a sigmoid to predict whether the path is out-of-order ( $\bar{y}$ ). Binary cross-entropy loss:

$\mathcal{L}_{NOP}(\theta) = -\left[y\log\bar{y} + (1-y)\log(1-\bar{y})\right]$

where $y=1$ if shuffled, $0$ otherwise.

2.3 Hybrid Pre-training Objective

TreeBERT is trained jointly via:

$\mathcal{L}(\theta) = \alpha\,\mathcal{L}_{TMLM}(\theta) + (1-\alpha)\,\mathcal{L}_{NOP}(\theta)$

with $\alpha = 0.75$ yielding optimal performance.

3. Model Architecture

TreeBERT adopts a standard Transformer encoder-decoder configuration (Vaswani et al. 2017) with key adaptations:

Encoder: 6 layers, hidden size $H = 1024$ , 8 attention heads.
Decoder: identical configuration.
Composition path vectors projected via a fully connected layer to match hidden size.
Node types: learned 1024-dimensional vectors.
Value nodes: BPE subtoken sum representations (32K merge operations).
Node-position embeddings added to encoder-side embeddings.
Shared vocabulary and embeddings across Java and Python.
No additional "structural" attention layers; all tree structure is encoded via path serialization and position embeddings.

4. Pre-training Corpora and Hyperparameters

TreeBERT leverages large-scale, tree-parsed code datasets:

Python: 7.2M files, 2B tokens, 500M AST paths, 7B AST nodes.
Java: 14.1M files, 4.5B tokens, 1.1B AST paths, 16.5B AST nodes.

Model limits:

Max paths per example: 100.
Max nodes per path: 20.
Max code length: 200 tokens.

Training specifics:

Adam optimizer ( $\beta_1 = 0.9$ , $\beta_2 = 0.999$ ), L2 decay $=0.01$ , dropout $=0.1$ , GELU activations.
Warmup over 10% of steps, followed by linear decay of the learning rate.
Learning rates: $\{1e{-3}, 1e{-4}, 1e{-5}\}$ ; best rate selected.
Batch sizes: $\{2048, 4096, 8192\}$ .
Masking rates: TMLM—15% of nodes per path; NOP—50% chance to shuffle a path.
Hybrid loss coefficient $\alpha = 0.75$ .

5. Empirical Evaluation on Code Generation Tasks

TreeBERT is evaluated on code summarization and code documentation tasks, with consistent performance gains over baselines.

5.1 Code Summarization

Task: predict function name subtokens from function body code.

Datasets: ETH Py150 (Python), Java-small/med/large splits.
Metrics: Precision, Recall, F1.

Table 1: Code Summarization Results

Model	Py150	Java-small	Java-med	Java-large
Transformer	16.13	31.41	41.22	48.13
Graph2Seq	29.26	42.79	53.05	58.15
Code2Seq	30.07	43.02	53.23	59.18
Code+Gnn+GRU	34.45	48.16	58.81	65.64
MASS	32.35	45.71	54.83	60.79
CuBERT	26.99	38.22	45.99	50.55
CodeBERT	29.58	41.10	49.64	54.76
TreeBERT	39.04	51.99	61.22	67.25

TreeBERT yields a 4–13 absolute F1 gain over the best non-pretrained model, and approximately 1–5 points over other pretrained baselines.

5.2 Code Documentation

Task: generate natural language comments for code methods.

Dataset: DeepCom (Java).
Metric: BLEU-4.

Table 2: Code Documentation (Java) Results

Model	BLEU
Transformer	13.32
DeepCom	14.81
Graph2Seq	18.39
Code2Seq	18.48
Code+Gnn+GRU	19.73
MASS	18.92
CuBERT	17.41
CodeBERT	17.87
TreeBERT	20.49

5.3 Ablation Study

BLEU drops compared to full TreeBERT (Java documentation):

“No TMLM”: 20.49 → 14.12 (–6.37)
“No NOP”: 20.49 → 16.71 (–3.78)
“No Node-Position Embedding”: 20.49 → 20.25 (–0.24)
“Random Masking in AST”: 20.49 → 14.81 (–5.68)
“Only Mask Value Nodes”: 20.49 → 18.25 (–2.24)

This suggests the TMLM and NOP objectives and tree-specific masking are essential to TreeBERT’s performance gains.

6. Generalization to Unseen Programming Languages

TreeBERT exhibits language-agnostic generalization. When fine-tuned for code documentation on C# (CodeNN’s StackOverflow QA), TreeBERT outperforms all baselines, despite C# being unseen during pre-training.

Table 3: Generalization to C

Model	BLEU
CodeNN	14.18
MASS	16.84
CuBERT	14.95
CodeBERT	15.31
TreeBERT	17.94

A plausible implication is that the root→leaf path representation and node-position embeddings confer strong language-agnostic inductive bias via AST structure. TreeBERT’s transfer strength substantiates the utility of tree-based learning for programming LLMs.

7. Significance and Contributions

TreeBERT introduces a principled framework for incorporating tree structure into code representation and pre-training for programming language generation tasks. By linearizing ASTs as composition paths and embedding tree-hierarchical information, combined with syntactic and semantic objectives (TMLM and NOP), TreeBERT advances the state-of-the-art in code summarization and documentation, with empirical evidence of cross-language generalization. The model’s design supports scaling for large datasets and codebases, while ablation studies underscore the critical importance of tree-specific pre-training protocols (Jiang et al., 2021).

Markdown Report Issue Upgrade to Chat

References (1)

TreeBERT: A Tree-Based Pre-Trained Model for Programming Language (2021)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TreeBERT.

TreeBERT: Tree-Structured Code Model

1. AST Representation: Composition Paths and Node Position Embedding

2. Pre-training Objectives: Tree Masked Language Modeling (TMLM) and Node Order Prediction (NOP)

2.1 Tree Masked Language Modeling (TMLM)

2.2 Node Order Prediction (NOP)

2.3 Hybrid Pre-training Objective

3. Model Architecture

4. Pre-training Corpora and Hyperparameters

5. Empirical Evaluation on Code Generation Tasks

5.1 Code Summarization

Table 1: Code Summarization Results

5.2 Code Documentation

Table 2: Code Documentation (Java) Results

5.3 Ablation Study

6. Generalization to Unseen Programming Languages

Table 3: Generalization to C

7. Significance and Contributions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

TreeBERT: Tree-Structured Code Model

1. AST Representation: Composition Paths and Node Position Embedding

2. Pre-training Objectives: Tree Masked Language Modeling (TMLM) and Node Order Prediction (NOP)

2.1 Tree Masked Language Modeling (TMLM)

2.2 Node Order Prediction (NOP)

2.3 Hybrid Pre-training Objective

3. Model Architecture

4. Pre-training Corpora and Hyperparameters

5. Empirical Evaluation on Code Generation Tasks

5.1 Code Summarization

Table 1: Code Summarization Results

5.2 Code Documentation

Table 2: Code Documentation (Java) Results

5.3 Ablation Study

6. Generalization to Unseen Programming Languages

Table 3: Generalization to C

7. Significance and Contributions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research