CodeBERT: Bimodal Pre-trained Model for Code

Updated 13 September 2025

CodeBERT is a bimodal pre-trained language model that integrates programming and natural language using a bidirectional Transformer architecture.
It employs masked language modeling and replaced token detection to learn robust, cross-modal representations for tasks like code search, summarization, and clone detection.
The model demonstrates high performance with clone detection F1 scores over 90% and bug repair accuracy up to 72%, though challenges remain in generalization and efficiency.

CodeBERT is a bimodal pre-trained LLM that jointly encodes programming language (PL) and natural language (NL) for code understanding and generation tasks. Its architecture, training strategy, and downstream applications have established it as a pivotal model for a wide spectrum of software engineering methodologies, particularly those that leverage transfer learning for cross-modal code–NL problems.

1. Model Architecture and Input Design

CodeBERT is based on a bidirectional Transformer architecture equivalent in scale to RoBERTa-base: 12 layers, 768 hidden units, and 12 attention heads, totaling approximately 125 million parameters. The model is explicitly constructed to ingest paired NL and PL sequences, using a tokenization scheme that applies WordPiece to NL and “code token” segmentation to PL. The standard encoding format is:

1	[CLS], w₁, w₂, ..., wₙ, [SEP], c₁, c₂, ..., cₘ, [EOS]

[CLS]: Aggregate representation.
wᵢ: NL tokens (e.g., function description).
[SEP]: NL–PL separator.
cⱼ: PL tokens (e.g., code).
[EOS]: End-of-sequence token.

This design enables simultaneous intra-modal (NL or PL) and inter-modal (NL–PL) representation learning, facilitating tasks that require joint attention to code and its documentation or usage context (Feng et al., 2020, Lu et al., 2021).

2. Pretraining Objectives and Learning Dynamics

CodeBERT’s learning paradigm is a hybrid objective combining:

Masked Language Modeling (MLM): 15% random masking of both NL and PL tokens, prediction from surrounding context. The loss is:

$L_\mathrm{MLM}(\theta) = \sum_{i \in m} -\log p^{(D_1)}(x_i \mid x_{\text{masked}})$

where $m$ is the set of masked positions.

Replaced Token Detection (RTD): After masking, replacement candidates are sampled from n-gram generators for NL and PL, yielding “corrupted” inputs. CodeBERT (discriminator) is trained to classify each token as original/replaced:

$L_\mathrm{RTD}(\theta) = \sum_{i=1}^{|x|} \left[ \delta(i) \cdot \log p^{(D_2)}(x_i) + (1-\delta(i)) \cdot \log(1-p^{(D_2)}(x_i)) \right]$

where $\delta(i)$ indicates original/replaced status.

Both paired (NL–PL) and large-scale unimodal PL corpora are exploited to ensure robust semantic encoding, context sensitivity, and cross-modal alignment. This training framework enhances CodeBERT’s capacity for complex embedding and discrimination beyond what unimodal or pure-MLM objectives achieve (Feng et al., 2020, Lu et al., 2021).

3. Principal Downstream Applications

CodeBERT’s capabilities manifest strongly in:

Natural Language Code Search: NL queries and code snippets are co-embedded; score/ranking heads assess semantic relevance. Superior MRR scores across multiple languages in CodeSearchNet confirm state-of-the-art performance (Feng et al., 2020, Lu et al., 2021).
Code Documentation Generation: CodeBERT serves as an encoder in transformer-based generation pipelines for synthesizing natural-language documentation from source code, outperforming RoBERTa and other sequence-to-sequence baselines as measured by BLEU-4 and other NLG metrics.
Clone and Defect Detection: Within CodeXGLUE, CodeBERT is fine-tuned as a classifier or ranking model for clone classification, defect prediction, cloze tests, and code summarization. Across these tasks, CodeBERT consistently outperforms pure-NL or code-only pre-trained models, with F1 scores in clone detection exceeding 90 and accuracy in cloze prediction above 85% (Lu et al., 2021).
Mutation Testing: Leveraging the MLM head, CodeBERT proposes realistic, context-aware token replacements for mutation operators. Studies report that CodeBERT’s mutants semantically resemble 60%+ of real-world faults (substantially more than PiTest or DeepMutation), and fault detection rates are competitive with traditional syntactic mutation tools (Ojdanic et al., 2021, Degiovanni et al., 2022).
Bug Repair and Logical Error Correction: When adapted to a sequence-to-sequence setting, CodeBERT (with an added transformer decoder or via masked filling) repairs Java bugs with up to 72% accuracy on duplicate datasets, and logical error correction achieves 74.58% token-level repair in Python using prompt-based localization and soft prompts (Mashhadi et al., 2021, Xu et al., 10 Oct 2024).

4. Model Efficiency, Compression, and Practical Constraints

The quadratic complexity of self-attention in CodeBERT raises practical concerns for long code inputs. Research on DietCode shows that data-side pruning (random dropout, frequency filtering, or attention-based knapsack selection) reduces input size by ~40% without significant performance loss (MRR drops from 0.74 to 0.71 for code search), yielding fine-tuning time reductions of 30–50% (Zhang et al., 2022).

Model compression strategies—knowledge distillation, quantization, and pruning—have disparate impacts (d'Aloisio et al., 18 Dec 2024):

Distillation halves model size and accelerates inference (e.g., –40% CPU), but can cause >10% effectiveness loss in complex tasks.
Quantization reduces memory footprint by ~50–60% with minimal accuracy drops, although it may increase inference latency due to hardware constraints.
Pruning shows inconsistent latency gains and introduces notable effectiveness degradation.

Hardware architecture and task requirements dictate the optimal compression-choice trade-off.

5. Limitations, Generalization, and Probing Behavior

Despite strong in-distribution results, CodeBERT’s generalizability to out-of-distribution or structurally divergent data is limited:

Semantic clone detection F1 falls from ~95% (benchmark) to 49–68% when applied to unseen clone/functionality IDs (Sonnekalb et al., 2022).
Logical semantic grounding is weak without fine-tuning; CodeBERT tends to model surface forms and depends strongly on programmer-defined identifiers (variable, method, and invocation names). Extensive anonymization of identifiers drops code search accuracy from 70% to 17% in Java (Zhang et al., 2023, Naik et al., 2022).

Representational Similarity Analysis shows that fine-tuning, especially on bimodal (NL–PL) input, boosts semantic alignment. Nonetheless, CodeBERT is prone to overfitting well-represented constructs and shallow cues (Naik et al., 2022).

6. Extensions, Hybridization, and Current Research Directions

CodeBERT underpins derivative frameworks for mutation testing ( $\mu$ BERT (Degiovanni et al., 2022)), Simulink model mutation (BERTiMuS (Zhang et al., 13 Jan 2025)), and code naturalness assessment (CodeBERT-nt (Khanfir et al., 2022)). In code comment generation, fine-tuning and retrieval augmentation lead to improvements on specialized domains such as Bash scripts (Yu et al., 2022).

Hybrid systems integrating CodeBERT with autoregressive LLMs (e.g., GPT-3.5) for code completion fuse CodeBERT’s context-aware encoding ( $F_\mathrm{CB}$ ) with generative features ( $F_\mathrm{GPT}$ ) using a weighted layer:

$F = a \cdot F_{\mathrm{CB}} + (1-a) \cdot F_{\mathrm{GPT}}$

These systems exhibit improved accuracy (F1 up to 0.91), code quality (BLEU), and robustness across noisy or incomplete input scenarios, outperforming stand-alone models (Zhang et al., 10 Sep 2025).

In low-resource languages and vulnerability prediction, CodeBERT’s performance degrades with data scarcity (e.g., F1-score dropping to 0.25–0.43 in Kotlin, Swift, Rust), and conventional oversampling undersampling techniques do not provide consistent remediation. LLMs (e.g., ChatGPT) have shown up to 34.4% better F1 in such scenarios (Le et al., 26 Apr 2024).

7. Interpretability, Syntactic Representation, and Future Directions

Analysis of CodeBERT’s transformer layers reveals that attention weights alone are insufficient to explain model predictions. The scaled transformation norm $||\alpha f(x)||$ (attention weight $\alpha$ times input transformation $f(x)$ ) better captures syntactic-alignment with abstract syntax trees than $\alpha$ alone, with up to 83.4% agreement in syntactic structure at certain layers (Saad et al., 2023).

Given CodeBERT’s dependence on surface forms, future work will likely focus on:

Enhancing semantic and logical feature extraction, potentially via integrating ASTs, data-flow graphs, or static analyses.
Improving cross-lingual robustness and generalization to unseen code patterns.
Increasing interpretability via refined attention or representational diagnostics.
Enabling resource-efficient deployment through task-specific pruning, quantization, or distillation that are dynamically adapted to deployment context.

In summary, CodeBERT combines a robust, flexible pretraining objective and an architecture optimized for bimodal representation, achieving leading performance in code–NL downstream tasks, but ongoing research continues to address its generalization, efficiency, and semantic depth.