Papers
Topics
Authors
Recent
Search
2000 character limit reached

CharacterBERT: A Character-Aware BERT Encoder

Updated 12 March 2026
  • CharacterBERT is a variant of BERT that represents words through character-level embeddings using a CNN and highway layers for robust, open-vocabulary encoding.
  • The model replaces traditional WordPiece tokenization with direct character inputs, yielding improved resilience to typos and better domain adaptation in specialized texts.
  • Empirical results show CharacterBERT outperforms standard BERT on biomedical and clinical benchmarks, though it incurs additional computational cost during pretraining.

A Character-Aware BERT Encoder, known as CharacterBERT, is a variant of the BERT architecture that eschews WordPiece or subword-level tokenization in favor of representing words directly via their characters. CharacterBERT computes expressive, word-level, open-vocabulary embeddings using a convolutional neural network over character sequences, followed by highway layers and a projection to the BERT embedding dimension. This enables robust handling of out-of-vocabulary words, greater resilience to typographical errors, and improved adaptation to specialized domains such as the biomedical and clinical text. The Transformer stack and pretraining recipes are otherwise unchanged from standard BERT, facilitating immediate applicability in existing downstream tasks while providing empirical improvements in robustness and domain adaptation (Boukkouri et al., 2020, Cao, 2023, Zhuang et al., 2022).

1. Model Architecture and Embedding Pipeline

CharacterBERT replaces BERT’s WordPiece token embedding layer with a character-level module composed of the following stages:

  1. Character Embedding Lookup: Each word token ww is represented as up to 50 byte-level characters [c1,...,cL][c_1, ..., c_L] from a fixed character vocabulary V=263|\mathcal V|=263 (UTF-8 plus special symbols). Each character cic_i is mapped to a 16-dimensional embedding via a lookup matrix CR263×16C \in \mathbb{R}^{263 \times 16}:

ei=C[ci]R16,i=1L50e_i = C[c_i] \in \mathbb{R}^{16}, \quad i = 1 \ldots L \leq 50

  1. Character-CNN Encoding: The stack of character embeddings E=[e1,...,eL]E = [e_1, ..., e_L] is passed through 1D convolutions of widths k=1k=1 to $7$, with increasing output channels:

H(k)=Conv1D(E;W(k),b(k))Rfk×(Lk+1)H^{(k)} = \mathrm{Conv1D}(E; W^{(k)},b^{(k)}) \in \mathbb{R}^{f_k \times (L-k+1)}

where

fk={32if k=1,2 64k=3 128k=4 256k=5 512k=6 1024k=7f_k = \begin{cases} 32 & \text{if } k=1,2 \ 64 & k=3 \ 128 & k=4 \ 256 & k=5 \ 512 & k=6 \ 1024 & k=7 \end{cases}

  1. Max-Over-Time Pooling and Concatenation: Each feature map undergoes max pooling:

h~j(k)=max1iLk+1Hj,i(k),j=1,...,fk\tilde{h}^{(k)}_j = \max_{1 \leq i \leq L-k+1} H^{(k)}_{j,i} , \quad j=1,...,f_k

The pooled outputs for all filter widths are concatenated into a 2048-dimensional vector hh.

  1. Highway Layers and Down-Projection: Two Highway layers mix hh with its nonlinearly transformed version, followed by a linear projection to the model dimension d=768d=768:

z=Wp(Highway2(Highway1(h)))+bp,zR768z = W_p \cdot (\text{Highway}_2(\text{Highway}_1(h))) + b_p, \quad z \in \mathbb{R}^{768}

  1. Integration with Transformer Encoder: The word embedding ztz_t for token tt is summed with positional (PtP_t) and segment embeddings (StS_t) to give the input to the first Transformer layer:

Xt=zt+Pt+StX_t = z_t + P_t + S_t

The upstream Transformer layers, attention mechanisms, and normalization procedures are standard BERT-base: 12 layers, 12 self-attention heads, hidden size 768, intermediate size 3072, GELU activations, with no modifications.

This model-level replacement with a "CharacterCNN + highway" module is the only deviation from standard BERT (Boukkouri et al., 2020, Cao, 2023, Zhuang et al., 2022).

2. Tokenization Strategy and Open-Vocabulary Handling

Whereas BERT relies on a fixed WordPiece vocabulary of approximately 30,000 subword tokens, CharacterBERT performs no subword segmentation. Each token—defined at the word level by a conventional tokenizer—is treated as an atomic string of bytes over the character vocabulary. This enables open-vocabulary encoding: any word, including true out-of-vocabulary types and arbitrary strings (so long as their byte-level characters are in V\mathcal V), can be embedded and processed.

WordPiece tokenization in BERT can cause semantically questionable splitting, particularly for specialized or rare terms; typographical errors can result in drastic changes to subword sequence, leading to unpredictable embeddings. CharacterBERT is immune to such fragmentation, as the convolutional character composition is invariant under the absence of subword bookkeeping and can thus represent all lexical variations in a consistent embedding space (Boukkouri et al., 2020, Cao, 2023, Zhuang et al., 2022).

3. Pretraining Objectives and Optimization

CharacterBERT preserves the two-headed pretraining objectives of BERT with minor adjustments:

  1. Masked Language Modeling (MLM):
    • 15% of tokens selected randomly; within this set, 80% are replaced with [MASK][\text{MASK}], 10% replaced with a random token, 10% left intact.
    • Unlike BERT, the MLM prediction is at the word level rather than per subword unit. The MLM head performs a softmax over the top 100,000 most frequent word types from the pretraining corpus.
    • The MLM loss:

    LMLM=iMlogp(wiw~)\mathcal{L}_{\mathrm{MLM}} = -\sum_{i \in M} \log p(w_i | \tilde{w})

    where MM indexes masked positions, w~\tilde{w} is the corrupted sentence (masking applied).

  2. Next Sentence Prediction (NSP):

    • Binary classification: given two sentence segments (A,B)(A, B), predict whether BB follows AA in the original text, using the [CLS] token pooled output.
  3. Optimization and Training Data:
    • Pretraining for general-domain: Wikipedia (\approx6M documents, 2.14B tokens), OpenWebText (\approx1.56M documents, 1.28B tokens).
    • Medical-domain adaptation: MIMIC-III clinical notes (0.505B tokens), PMC OA abstracts (0.522B tokens).
    • Two-stage training: large batch/short sequence, then smaller batch/long sequence, learning rates 6×1036 \times 10^{-3} and 4×1034 \times 10^{-3}, with LAMB or Adam optimizers, weight decay and linear warmup (Boukkouri et al., 2020, Cao, 2023).

Unlike typical subword models, CharacterBERT must engineer a fixed top-100k target vocabulary for the MLM classifier. The model retains the upstream need for word boundary segmentation to define masking spans and task heads.

4. Empirical Performance and Robustness

CharacterBERT's empirical performance demonstrates improvements in both domain-specific and general-language settings:

Task BERT_medical CharBERT_medical Difference
i2b2 87.9 ± 0.8 88.4 ± 0.5 +0.5
MEDNLI 76.1 ± 1.2 76.7 ± 0.8 +0.6
ChemProt 85.3 ± 0.7 86.8 ± 0.6 +1.5
DDI 80.5 ± 1.1 82.5 ± 0.9 +2.0
ClinicalSTS 0.90 ± 0.02 0.88 ± 0.04 –0.02

CharacterBERT achieves statistically significant gains on ChemProt, DDI, i2b2, and MEDNLI benchmarks, with slightly lower performance on ClinicalSTS, which is attributed to the small and noisy nature of that dataset (Boukkouri et al., 2020). On biomedical tasks (e.g., BLURB benchmark), average F1 increased from 88.2 (BERT) to 90.1 (CharacterBERT) (Cao, 2023). General-language performance is matched or sometimes slightly improved.

Robustness to Misspellings:

Injected character-level noise (up to 40% of tokens altered per sequence) sharply degrades vanilla BERT, e.g., at 40% noise, BERT_medical F1 drops to ≈30, while CharacterBERT_medical retains ≈5 F1 points higher performance, exhibiting more graceful degradation under typographical attacks. When both training and evaluation data are noised, CharacterBERT continues to show an edge, confirming that its robustness is intrinsic to its character-aware design (Boukkouri et al., 2020, Zhuang et al., 2022).

Dense Retrieval and Typos:

In a dense retrieval setting (e.g., MS MARCO, TREC DL), CharacterBERT-DR with Self-Teaching (ST) reduces performance loss from typos by over 50% relative to standard BERT-retrievers, with MRR@10 on typoed queries increasing from 0.136 (BERT) to 0.263. Recall at 1000 improves by over 20 points, with no loss of accuracy on clean queries (Zhuang et al., 2022).

5. Practical Considerations, Engineering, and Ablations

Compute Considerations:

  • Pretraining time is approximately doubled compared to standard BERT due to the heavier Character-CNN and the lack of embedding/output layer weight sharing.
  • Fine-tuning incurs ≈19% extra computation, while inference is comparable or marginally faster due to the efficient forward-only convolutional pipeline (Boukkouri et al., 2020).

Module Ablation Studies:

  • Removing the highway layer or reducing CNN filter number reduces F1 on downstream tasks by 1–2 points, confirming the necessity of these architectural elements for optimal performance (Cao, 2023).
  • Replacing the character input pipeline with a WordPiece lookup recovers exactly BERT’s baseline performance, demonstrating that all performance gains are attributable to the character-aware embedding module.

Recipe for Practitioners:

To construct a CharacterBERT encoder:

  • Instantiate a character vocabulary (e.g., full ASCII/Unicode set), character embeddings (dim dchar=16d_{\rm char}=16–100);
  • Build a convolutional feature extractor (widths: {1,...,7}\{1,...,7\}, corresponding filters), apply two highway layers, project to 768;
  • Use the standard BERT-base Transformer stack (12 layers, 12 heads) without further modifications;
  • For pretraining: MLM at 15% token mask, NSP as in BERT, with a top-K word softmax output (Boukkouri et al., 2020, Cao, 2023).

CharBERT (Ma et al., 2020) and other variants also adopt character-level information, but through a dual-channel architecture, combining parallel subword and character-based representations via a “heterogeneous interaction” (HI) module at each Transformer layer. CharBERT introduces an additional Noisy LLM (NLM) pretraining objective with explicit character corruption (drop/add/swap), providing denoising supervision. Gains are reported in both standard and adversarial settings (e.g., SQuAD2.0 F1: BERT 76.3→CharBERT 78.6), with smaller performance degradation under adversarial misspelling attacks (e.g., QNLI accuracy: BERT 90.7→63.4, CharBERT 91.7→80.1 under attack).

A substantive difference is that CharacterBERT entirely replaces BERT’s subword embedding layer, operating solely at the word-as-character sequence granularity, whereas CharBERT fuses both subword and character sequences at multiple network depths. Both methods outperform subword-only models on tasks sensitive to textual perturbations. However, CharBERT’s added BiGRU and fusion modules increase model complexity and parameter count, whereas CharacterBERT remains structurally simple (Ma et al., 2020, Cao, 2023).

7. Limitations and Future Research Directions

CharacterBERT has several practical and theoretical limitations:

  • Training Efficiency: Character-level convolutions and non-shared output layers increase training time (≈2× BERT) and GPU resource demand.
  • Vocabulary Engineering: Requires explicit construction of a large word-vocabulary output layer for MLM prediction, which may be less tractable in highly inflected or agglutinative languages.
  • Persisting Need for Token Boundaries: While all representation is open-vocabulary, a conventional tokenizer is still needed to delineate masking spans during pretraining and to interface with downstream heads.
  • Scope of Robustness: Although robustness to single character-level noise is substantially improved, catastrophic errors under heavy corruption or non-lexical input may remain.

Proposed enhancements include replacing the softmax MLM head with efficient sampling-based objectives (e.g., Noise-Contrastive Estimation), extending the architecture to lighter Transformer variants, generalizing to multilingual settings, and expanding coverage to non-classification tasks such as span tagging and generation (Boukkouri et al., 2020, Cao, 2023).

CharacterBERT illustrates that character-aware, open-vocabulary input encoding is a principled, empirically effective strategy for neural language modeling, especially in domains or applications where WordPiece-style subword vocabularies are fragile or inadequate. Its combination of architectural minimalism and robust lexical generalization is a distinguishing factor in the evolving landscape of Transformer-based encoders.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Character-Aware BERT Encoder (CharacterBERT).