CodeBERT: Transformer for NL and Code

Updated 24 January 2026

CodeBERT is a Transformer-based encoder that learns joint representations for natural language and programming languages using a hybrid pre-training approach on large NL–PL corpora.
It leverages masked language modeling and replaced token detection objectives to achieve state-of-the-art performance in code search, summarization, completion, and program repair.
Empirical evaluations reveal that while CodeBERT excels at lexical analysis through self-attention, it underperforms in logic-level tasks without explicit structural cues.

CodeBERT is a Transformer-based, encoder-only pre-trained model designed to learn joint representations for both natural language (NL) and programming language (PL) corpora, enabling robust downstream tasks such as code search, code summarization, code completion, and automated program repair. Established by Feng et al. (Microsoft Research), CodeBERT introduces hybrid pre-training objectives on extensive NL–PL paired and monolingual corpora, providing state-of-the-art performance across tasks and serving as the foundation for several derivative models and practical subsystems (Feng et al., 2020, Zhang et al., 10 Sep 2025).

1. Architectural Specification

CodeBERT adopts the BERT/RoBERTa backbone: a stack of 12 Transformer encoder layers, each with 12 self-attention heads (head size 64; hidden size 768), residual connections, and layer normalization. The input embedding sums token and position embeddings—typically without segment embeddings in recent implementations (Zhang et al., 10 Sep 2025). The model operates on byte-level BPE tokenizations, supporting a unified vocabulary of roughly 50,000 tokens for code and NL.

Key equations for layer ℓ:

Self-attention for head $h$ :

$\mathrm{Attention}(Q^{(\ell)},K^{(\ell)},V^{(\ell)}) = \mathrm{softmax} \left( \frac{Q^{(\ell)} {K^{(\ell)}}^T}{\sqrt{d_k}} \right) V^{(\ell)}$

Multi-head output:

$\mathrm{MultiHead}(X) = \mathrm{Concat}(head_1, ..., head_h) W_O$

Feed-forward:

$\mathrm{FFN}(X) = \max(0, XW_1 + b_1) W_2 + b_2$

CodeBERT’s sequence length is capped at 512 tokens due to quadratic attention complexity (Zhang et al., 2022).

2. Pre-Training Data and Objectives

Pre-training leverages both bimodal (NL–PL) and unimodal (NL or PL only) datasets:

2.1M NL–PL pairs (GitHub; six languages: Python, Java, JavaScript, PHP, Ruby, Go)
~6M monolingual code files
~6M monolingual NL documents (Feng et al., 2020)

Training objectives:

Masked Language Modeling (MLM): Predict masked tokens using context.

$L_{MLM} = - \mathbb{E}_{x \sim D} \sum_{i \in M} \log P(x_i | x_{mask})$

Replaced Token Detection (RTD): Classify real vs. generator-replaced tokens (ELECTRA-style).

$L_{RTD} = - \mathbb{E}_{x \sim D} \left[ y \log \sigma(f(x)) + (1-y) \log (1-\sigma(f(x))) \right]$

where $y=1$ if token is replaced, $f(x)$ is serializer logit.

Bimodal objective design ensures that CodeBERT learns cross-modal NL–PL correspondence, while unimodal RTD enhances code-token discrimination (Feng et al., 2020).

3. Feature Representation and Limitations

CodeBERT representations rely heavily on lexical cues, especially programmer-defined variable and function names. Empirical investigation reveals that token-level self-attention and MLM training make the model highly sensitive to identifier names. If user-defined names are anonymized, code search accuracy suffers a sharp decline (from 70.36% to 17.42% on Java), demonstrating that CodeBERT does not robustly encode logic-level or structural features (Zhang et al., 2023).

Table: Identifier Impact on CodeBERT Accuracy (Java; code search task)

Scenario	Accuracy (%)
Original	70.36
Method-def anonymized	60.89
All names anonymized	17.42

This suggests that current BERT-style code models excel in lexical analysis, but underperform in logic analysis and general program understanding absent explicit syntactic or semantic signals.

4. Downstream Tasks and Fine-Tuning

4.1. Code Completion and Summarization

Fine-tuned with AdamW (batch size ~32, 3–5 epochs, lr∈[1e−5,2e−5]), CodeBERT achieves notable performance on tasks such as code completion:

Accuracy: 0.87 (CodeXGLUE Python; baseline: 0.82)
Precision: 0.85, Recall: 0.86, F1: 0.85
BLEU: 0.72 (baseline: 0.65)
Code Executability: 0.85
Semantic Consistency: 0.79 (Zhang et al., 10 Sep 2025)

In summarization tasks, CodeBERT achieves macro-avg BLEU-4 scores of 17.83, outperforming RoBERTa and RNN baselines (Feng et al., 2020).

4.2. Program Repair and Vulnerability Patching

Leveraging its encoder module and sequence-to-sequence configuration (6-layer decoder), CodeBERT demonstrates 19–72% exact-match patch accuracy on Java bugs (ManySStuBs4J) (Mashhadi et al., 2021), and CodeBLEU scores of 0.42 (MegaVul_C_2023) in vulnerability patching (Khan et al., 5 Jun 2025):

Table: Vulnerability Patch Task (CodeBLEU Score)

Dataset	CodeBERT
MegaVul_C_2023	0.42
Vul4J	0.44

CodeBERT’s encoder-only design yields competitive robustness in fragmented or sparse contexts compared to more complex encoder–decoder architectures.

5. Efficiency, Attention, and Model Extensions

CodeBERT’s quadratic attention cost motivates preprocessing approaches such as DietCode (Zhang et al., 2022):

DietCode prunes input to high-attention tokens/statements by attention analysis.
Reduces wall-clock fine-tuning by ~40% and FLOPs by ~40% (Java code search: 20.8h to 11.1h fine-tune time).
At 60% relative length, attention-based DietCode preserves ~0.71 MRR (vs. 0.74 for full).

Table: DietCode Input Pruning (Java, code search)

Model	Input Len	MRR	Fine-tune Time
CodeBERT	200	0.74	20.8 h
DietCode	120	0.71	11.1 h

Additionally, model efficiency has been improved by adapter modules (CodeBERTER) (Saberi et al., 2023), inserting lightweight NER adapters for AST-derived syntactic information. This enables parameter-efficient fine-tuning (training only ~20% of model weights) with accuracy and BLEU improvements on code refinement and summarization.

6. Integration in Hybrid and Real-World Systems

CodeBERT has been successfully integrated into hybrid subsystems with generative models like GPT-3.5 for code completion (Zhang et al., 10 Sep 2025):

Feature fusion layer: $z = a\cdot f_C + (1-a)\cdot f_G$ with $a\in[0,1]$ learned by sigmoid.
Score combination: $score(token)=\lambda\log P_C(token)+(1-\lambda)\log P_G(token)$ rescoring top-k candidates.

Empirical system metrics:

CodeBERT: 5.1 ms latency, 196 tokens/s, 5.1 GB RAM.
Hybrid: 68 ms latency, 213 tokens/s, 6.2 GB RAM.

Pilot IDE trials (VS Code) demonstrate ≈15% keystroke reduction and 25% fewer fixups through high-precision local completions (CodeBERT) and generative coverage (GPT-3.5).

7. Semantic Evaluation and Naturalness

CodeBERT’s representations have limited semantic grounding without targeted fine-tuning (Naik et al., 2022). Representational Similarity Analysis (RSA) shows that pre-training yields low semantic similarity ( $RS_{\ell}\lesssim0.1$ ), but fine-tuning on semantic tasks sharply boosts $RS$ in deeper layers. Bimodal input (NL+PL) further enhances semantic alignment by 200–500% and improves sample efficiency.

For code naturalness estimation, CodeBERT-nt (Khanfir et al., 2022) masks AST nodes and measures predictability via minimum-confidence aggregation, outperforming random and complexity baselines and matching n-gram entropy models in zero-shot buggy line identification.

Table: CB-nt (min-confidence) ranking vs. baselines, SmartShark

Comparison	First-hit A₁₂	Mean A₁₂	p-value
vs. Random	0.607	0.622	<0.001
vs. Complexity	0.605	0.620	<0.001

8. Broader Applicability, Limitations, and Future Work

CodeBERT is extensible via adapters to leverage AST, type, or data-flow information without re-pretraining (Saberi et al., 2023). It has shown robust performance across languages and tasks. However, limitations include reliance on identifier names and lack of explicit logic or program structure modeling (Zhang et al., 2023).

Potential model enhancements focus on:

Integration of AST paths and flow graphs directly into representations.
Sequence-to-sequence pretraining and denoising objectives for generation.
Expanded language coverage, efficient parameterization, and plug-and-play fine-tuning strategies.

The current research trajectory suggests the merger of structural priors and contextual learning for superior code understanding and generation in future pre-trained program models.