Papers
Topics
Authors
Recent
Search
2000 character limit reached

TreeBERT: Tree-Structured Code Model

Updated 18 January 2026
  • TreeBERT is a tree-based pre-trained model that incorporates abstract syntax trees via composition paths and node-position embeddings to capture code hierarchies.
  • It employs a hybrid pre-training objective combining Tree Masked Language Modeling and Node Order Prediction to learn both semantic and syntactic code properties.
  • Empirical evaluations show TreeBERT outperforms baselines in code summarization and documentation tasks, demonstrating strong cross-language generalization.

TreeBERT is a tree-based pre-trained model designed to enhance generation tasks for programming languages by incorporating syntactic structure via abstract syntax trees (ASTs) (Jiang et al., 2021). Unlike prior work focusing primarily on code tokens or sequential code representations, TreeBERT utilizes AST composition paths and node position embeddings to represent code in a manner reflecting tree hierarchy and sibling ordering. It employs a hybrid pre-training objective, combining Tree Masked Language Modeling (TMLM) and Node Order Prediction (NOP), and achieves improved performance across code summarization and documentation tasks, demonstrating strong transferability to previously unseen programming languages.

1. AST Representation: Composition Paths and Node Position Embedding

TreeBERT parses code into its AST, representing the tree as a set of root-to-leaf composition paths:

A={p1,p2,,pN}A = \{p_1, p_2, \ldots, p_N\}

where each path pi=v1i,v2i,,vL1i,xtip_i = v^i_1, v^i_2, \ldots, v^i_{L-1}, x^i_t consists of non-terminal type nodes vjiv^i_j and a terminal value node xtix^i_t (corresponding to a code token).

Each path is vectorized as:

pi=Concat[Emb(v1i);;Emb(vL1i);Emb(xti)]p_i = \mathrm{Concat}\big[\mathrm{Emb}(v_1^i);\ldots;\mathrm{Emb}(v_{L-1}^i);\mathrm{Emb}(x_t^i)\big]

To recover tree structural information, each node receives a node-position embedding in addition to the type/value embedding. For a node at level jj with parent position embedding EparentE^{parent} and cc children, the ii-th child’s position embedding is:

Eiposition=ci+1c+1Eparent+ic+1Wj+1level,(1ic)E^{position}_i = \frac{c-i+1}{c+1}E^{parent} + \frac{i}{c+1}W^{level}_{j+1},\quad(1 \leq i \leq c)

where Wj+1levelW^{level}_{j+1} are learnable level embeddings. The compact scheme (approximately 1/200th the parameters of learned linear positions) allows extrapolation to arbitrarily large trees.

2. Pre-training Objectives: Tree Masked Language Modeling (TMLM) and Node Order Prediction (NOP)

TreeBERT learns both semantic and syntactic properties of code through two objectives:

2.1 Tree Masked Language Modeling (TMLM)

TMLM exploits tree locality by masking a subset of AST nodes along each path. The masking strategy favors deeper (near-leaf) nodes, assigning scores:

qni=exp(nL)j=1Lexp(jL)q^i_n = \frac{\exp(n-L)}{\sum_{j=1}^L \exp(j-L)}

Top-kk nodes are masked: miA=TOPK(pi,k,{qni})m_i^A = \mathrm{TOPK}(p_i, k, \{q^i_n\}), producing AmaskedA^{masked}. In the corresponding code token sequence CC, only tokens not corresponding to masked AST nodes are masked:

mC={xCxmA},Cmasked=REPLACE(C,mC,[mask])m^C = \{x \in C \mid x \notin m^A\},\quad C^{masked} = \mathrm{REPLACE}(C, m^C, [\mathrm{mask}])

This strategy forces the decoder to utilize tree-structured features. The TMLM loss is:

LTMLM(θ)=1Mt=1MlogP(xtx<t,Amasked)\mathcal{L}_{TMLM}(\theta) = -\frac{1}{M} \sum_{t=1}^M \log P(x_t \mid x_{<t},\,A^{masked})

2.2 Node Order Prediction (NOP)

NOP captures the ordering constraints in ASTs by randomly swapping two nodes in a path with 50% probability. The [CLS] output is passed through a sigmoid to predict whether the path is out-of-order (yˉ\bar{y}). Binary cross-entropy loss:

LNOP(θ)=[ylogyˉ+(1y)log(1yˉ)]\mathcal{L}_{NOP}(\theta) = -\left[y\log\bar{y} + (1-y)\log(1-\bar{y})\right]

where y=1y=1 if shuffled, $0$ otherwise.

2.3 Hybrid Pre-training Objective

TreeBERT is trained jointly via:

L(θ)=αLTMLM(θ)+(1α)LNOP(θ)\mathcal{L}(\theta) = \alpha\,\mathcal{L}_{TMLM}(\theta) + (1-\alpha)\,\mathcal{L}_{NOP}(\theta)

with α=0.75\alpha = 0.75 yielding optimal performance.

3. Model Architecture

TreeBERT adopts a standard Transformer encoder-decoder configuration (Vaswani et al. 2017) with key adaptations:

  • Encoder: 6 layers, hidden size H=1024H = 1024, 8 attention heads.
  • Decoder: identical configuration.
  • Composition path vectors projected via a fully connected layer to match hidden size.
  • Node types: learned 1024-dimensional vectors.
  • Value nodes: BPE subtoken sum representations (32K merge operations).
  • Node-position embeddings added to encoder-side embeddings.
  • Shared vocabulary and embeddings across Java and Python.
  • No additional "structural" attention layers; all tree structure is encoded via path serialization and position embeddings.

4. Pre-training Corpora and Hyperparameters

TreeBERT leverages large-scale, tree-parsed code datasets:

  • Python: 7.2M files, 2B tokens, 500M AST paths, 7B AST nodes.
  • Java: 14.1M files, 4.5B tokens, 1.1B AST paths, 16.5B AST nodes.

Model limits:

  • Max paths per example: 100.
  • Max nodes per path: 20.
  • Max code length: 200 tokens.

Training specifics:

  • Adam optimizer (β1=0.9\beta_1 = 0.9, β2=0.999\beta_2 = 0.999), L2 decay =0.01=0.01, dropout =0.1=0.1, GELU activations.
  • Warmup over 10% of steps, followed by linear decay of the learning rate.
  • Learning rates: {1e3,1e4,1e5}\{1e{-3}, 1e{-4}, 1e{-5}\}; best rate selected.
  • Batch sizes: {2048,4096,8192}\{2048, 4096, 8192\}.
  • Masking rates: TMLM—15% of nodes per path; NOP—50% chance to shuffle a path.
  • Hybrid loss coefficient α=0.75\alpha = 0.75.

5. Empirical Evaluation on Code Generation Tasks

TreeBERT is evaluated on code summarization and code documentation tasks, with consistent performance gains over baselines.

5.1 Code Summarization

Task: predict function name subtokens from function body code.

  • Datasets: ETH Py150 (Python), Java-small/med/large splits.
  • Metrics: Precision, Recall, F1.

Table 1: Code Summarization Results

Model Py150 Java-small Java-med Java-large
Transformer 16.13 31.41 41.22 48.13
Graph2Seq 29.26 42.79 53.05 58.15
Code2Seq 30.07 43.02 53.23 59.18
Code+Gnn+GRU 34.45 48.16 58.81 65.64
MASS 32.35 45.71 54.83 60.79
CuBERT 26.99 38.22 45.99 50.55
CodeBERT 29.58 41.10 49.64 54.76
TreeBERT 39.04 51.99 61.22 67.25

TreeBERT yields a 4–13 absolute F1 gain over the best non-pretrained model, and approximately 1–5 points over other pretrained baselines.

5.2 Code Documentation

Task: generate natural language comments for code methods.

  • Dataset: DeepCom (Java).
  • Metric: BLEU-4.

Table 2: Code Documentation (Java) Results

Model BLEU
Transformer 13.32
DeepCom 14.81
Graph2Seq 18.39
Code2Seq 18.48
Code+Gnn+GRU 19.73
MASS 18.92
CuBERT 17.41
CodeBERT 17.87
TreeBERT 20.49

5.3 Ablation Study

BLEU drops compared to full TreeBERT (Java documentation):

  • “No TMLM”: 20.49 → 14.12 (–6.37)
  • “No NOP”: 20.49 → 16.71 (–3.78)
  • “No Node-Position Embedding”: 20.49 → 20.25 (–0.24)
  • “Random Masking in AST”: 20.49 → 14.81 (–5.68)
  • “Only Mask Value Nodes”: 20.49 → 18.25 (–2.24)

This suggests the TMLM and NOP objectives and tree-specific masking are essential to TreeBERT’s performance gains.

6. Generalization to Unseen Programming Languages

TreeBERT exhibits language-agnostic generalization. When fine-tuned for code documentation on C# (CodeNN’s StackOverflow QA), TreeBERT outperforms all baselines, despite C# being unseen during pre-training.

Table 3: Generalization to C

Model BLEU
CodeNN 14.18
MASS 16.84
CuBERT 14.95
CodeBERT 15.31
TreeBERT 17.94

A plausible implication is that the root→leaf path representation and node-position embeddings confer strong language-agnostic inductive bias via AST structure. TreeBERT’s transfer strength substantiates the utility of tree-based learning for programming LLMs.

7. Significance and Contributions

TreeBERT introduces a principled framework for incorporating tree structure into code representation and pre-training for programming language generation tasks. By linearizing ASTs as composition paths and embedding tree-hierarchical information, combined with syntactic and semantic objectives (TMLM and NOP), TreeBERT advances the state-of-the-art in code summarization and documentation, with empirical evidence of cross-language generalization. The model’s design supports scaling for large datasets and codebases, while ablation studies underscore the critical importance of tree-specific pre-training protocols (Jiang et al., 2021).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TreeBERT.