CharacterBERT: A Character-Aware BERT Encoder
- CharacterBERT is a variant of BERT that represents words through character-level embeddings using a CNN and highway layers for robust, open-vocabulary encoding.
- The model replaces traditional WordPiece tokenization with direct character inputs, yielding improved resilience to typos and better domain adaptation in specialized texts.
- Empirical results show CharacterBERT outperforms standard BERT on biomedical and clinical benchmarks, though it incurs additional computational cost during pretraining.
A Character-Aware BERT Encoder, known as CharacterBERT, is a variant of the BERT architecture that eschews WordPiece or subword-level tokenization in favor of representing words directly via their characters. CharacterBERT computes expressive, word-level, open-vocabulary embeddings using a convolutional neural network over character sequences, followed by highway layers and a projection to the BERT embedding dimension. This enables robust handling of out-of-vocabulary words, greater resilience to typographical errors, and improved adaptation to specialized domains such as the biomedical and clinical text. The Transformer stack and pretraining recipes are otherwise unchanged from standard BERT, facilitating immediate applicability in existing downstream tasks while providing empirical improvements in robustness and domain adaptation (Boukkouri et al., 2020, Cao, 2023, Zhuang et al., 2022).
1. Model Architecture and Embedding Pipeline
CharacterBERT replaces BERT’s WordPiece token embedding layer with a character-level module composed of the following stages:
- Character Embedding Lookup: Each word token is represented as up to 50 byte-level characters from a fixed character vocabulary (UTF-8 plus special symbols). Each character is mapped to a 16-dimensional embedding via a lookup matrix :
- Character-CNN Encoding: The stack of character embeddings is passed through 1D convolutions of widths to $7$, with increasing output channels:
where
- Max-Over-Time Pooling and Concatenation: Each feature map undergoes max pooling:
The pooled outputs for all filter widths are concatenated into a 2048-dimensional vector .
- Highway Layers and Down-Projection: Two Highway layers mix with its nonlinearly transformed version, followed by a linear projection to the model dimension :
- Integration with Transformer Encoder: The word embedding for token is summed with positional () and segment embeddings () to give the input to the first Transformer layer:
The upstream Transformer layers, attention mechanisms, and normalization procedures are standard BERT-base: 12 layers, 12 self-attention heads, hidden size 768, intermediate size 3072, GELU activations, with no modifications.
This model-level replacement with a "CharacterCNN + highway" module is the only deviation from standard BERT (Boukkouri et al., 2020, Cao, 2023, Zhuang et al., 2022).
2. Tokenization Strategy and Open-Vocabulary Handling
Whereas BERT relies on a fixed WordPiece vocabulary of approximately 30,000 subword tokens, CharacterBERT performs no subword segmentation. Each token—defined at the word level by a conventional tokenizer—is treated as an atomic string of bytes over the character vocabulary. This enables open-vocabulary encoding: any word, including true out-of-vocabulary types and arbitrary strings (so long as their byte-level characters are in ), can be embedded and processed.
WordPiece tokenization in BERT can cause semantically questionable splitting, particularly for specialized or rare terms; typographical errors can result in drastic changes to subword sequence, leading to unpredictable embeddings. CharacterBERT is immune to such fragmentation, as the convolutional character composition is invariant under the absence of subword bookkeeping and can thus represent all lexical variations in a consistent embedding space (Boukkouri et al., 2020, Cao, 2023, Zhuang et al., 2022).
3. Pretraining Objectives and Optimization
CharacterBERT preserves the two-headed pretraining objectives of BERT with minor adjustments:
- Masked Language Modeling (MLM):
- 15% of tokens selected randomly; within this set, 80% are replaced with , 10% replaced with a random token, 10% left intact.
- Unlike BERT, the MLM prediction is at the word level rather than per subword unit. The MLM head performs a softmax over the top 100,000 most frequent word types from the pretraining corpus.
- The MLM loss:
where indexes masked positions, is the corrupted sentence (masking applied).
Next Sentence Prediction (NSP):
- Binary classification: given two sentence segments , predict whether follows in the original text, using the [CLS] token pooled output.
- Optimization and Training Data:
- Pretraining for general-domain: Wikipedia (6M documents, 2.14B tokens), OpenWebText (1.56M documents, 1.28B tokens).
- Medical-domain adaptation: MIMIC-III clinical notes (0.505B tokens), PMC OA abstracts (0.522B tokens).
- Two-stage training: large batch/short sequence, then smaller batch/long sequence, learning rates and , with LAMB or Adam optimizers, weight decay and linear warmup (Boukkouri et al., 2020, Cao, 2023).
Unlike typical subword models, CharacterBERT must engineer a fixed top-100k target vocabulary for the MLM classifier. The model retains the upstream need for word boundary segmentation to define masking spans and task heads.
4. Empirical Performance and Robustness
CharacterBERT's empirical performance demonstrates improvements in both domain-specific and general-language settings:
| Task | BERT_medical | CharBERT_medical | Difference |
|---|---|---|---|
| i2b2 | 87.9 ± 0.8 | 88.4 ± 0.5 | +0.5 |
| MEDNLI | 76.1 ± 1.2 | 76.7 ± 0.8 | +0.6 |
| ChemProt | 85.3 ± 0.7 | 86.8 ± 0.6 | +1.5 |
| DDI | 80.5 ± 1.1 | 82.5 ± 0.9 | +2.0 |
| ClinicalSTS | 0.90 ± 0.02 | 0.88 ± 0.04 | –0.02 |
CharacterBERT achieves statistically significant gains on ChemProt, DDI, i2b2, and MEDNLI benchmarks, with slightly lower performance on ClinicalSTS, which is attributed to the small and noisy nature of that dataset (Boukkouri et al., 2020). On biomedical tasks (e.g., BLURB benchmark), average F1 increased from 88.2 (BERT) to 90.1 (CharacterBERT) (Cao, 2023). General-language performance is matched or sometimes slightly improved.
Robustness to Misspellings:
Injected character-level noise (up to 40% of tokens altered per sequence) sharply degrades vanilla BERT, e.g., at 40% noise, BERT_medical F1 drops to ≈30, while CharacterBERT_medical retains ≈5 F1 points higher performance, exhibiting more graceful degradation under typographical attacks. When both training and evaluation data are noised, CharacterBERT continues to show an edge, confirming that its robustness is intrinsic to its character-aware design (Boukkouri et al., 2020, Zhuang et al., 2022).
Dense Retrieval and Typos:
In a dense retrieval setting (e.g., MS MARCO, TREC DL), CharacterBERT-DR with Self-Teaching (ST) reduces performance loss from typos by over 50% relative to standard BERT-retrievers, with MRR@10 on typoed queries increasing from 0.136 (BERT) to 0.263. Recall at 1000 improves by over 20 points, with no loss of accuracy on clean queries (Zhuang et al., 2022).
5. Practical Considerations, Engineering, and Ablations
Compute Considerations:
- Pretraining time is approximately doubled compared to standard BERT due to the heavier Character-CNN and the lack of embedding/output layer weight sharing.
- Fine-tuning incurs ≈19% extra computation, while inference is comparable or marginally faster due to the efficient forward-only convolutional pipeline (Boukkouri et al., 2020).
Module Ablation Studies:
- Removing the highway layer or reducing CNN filter number reduces F1 on downstream tasks by 1–2 points, confirming the necessity of these architectural elements for optimal performance (Cao, 2023).
- Replacing the character input pipeline with a WordPiece lookup recovers exactly BERT’s baseline performance, demonstrating that all performance gains are attributable to the character-aware embedding module.
Recipe for Practitioners:
To construct a CharacterBERT encoder:
- Instantiate a character vocabulary (e.g., full ASCII/Unicode set), character embeddings (dim –100);
- Build a convolutional feature extractor (widths: , corresponding filters), apply two highway layers, project to 768;
- Use the standard BERT-base Transformer stack (12 layers, 12 heads) without further modifications;
- For pretraining: MLM at 15% token mask, NSP as in BERT, with a top-K word softmax output (Boukkouri et al., 2020, Cao, 2023).
6. Comparison with Related Character-aware Models
CharBERT (Ma et al., 2020) and other variants also adopt character-level information, but through a dual-channel architecture, combining parallel subword and character-based representations via a “heterogeneous interaction” (HI) module at each Transformer layer. CharBERT introduces an additional Noisy LLM (NLM) pretraining objective with explicit character corruption (drop/add/swap), providing denoising supervision. Gains are reported in both standard and adversarial settings (e.g., SQuAD2.0 F1: BERT 76.3→CharBERT 78.6), with smaller performance degradation under adversarial misspelling attacks (e.g., QNLI accuracy: BERT 90.7→63.4, CharBERT 91.7→80.1 under attack).
A substantive difference is that CharacterBERT entirely replaces BERT’s subword embedding layer, operating solely at the word-as-character sequence granularity, whereas CharBERT fuses both subword and character sequences at multiple network depths. Both methods outperform subword-only models on tasks sensitive to textual perturbations. However, CharBERT’s added BiGRU and fusion modules increase model complexity and parameter count, whereas CharacterBERT remains structurally simple (Ma et al., 2020, Cao, 2023).
7. Limitations and Future Research Directions
CharacterBERT has several practical and theoretical limitations:
- Training Efficiency: Character-level convolutions and non-shared output layers increase training time (≈2× BERT) and GPU resource demand.
- Vocabulary Engineering: Requires explicit construction of a large word-vocabulary output layer for MLM prediction, which may be less tractable in highly inflected or agglutinative languages.
- Persisting Need for Token Boundaries: While all representation is open-vocabulary, a conventional tokenizer is still needed to delineate masking spans during pretraining and to interface with downstream heads.
- Scope of Robustness: Although robustness to single character-level noise is substantially improved, catastrophic errors under heavy corruption or non-lexical input may remain.
Proposed enhancements include replacing the softmax MLM head with efficient sampling-based objectives (e.g., Noise-Contrastive Estimation), extending the architecture to lighter Transformer variants, generalizing to multilingual settings, and expanding coverage to non-classification tasks such as span tagging and generation (Boukkouri et al., 2020, Cao, 2023).
CharacterBERT illustrates that character-aware, open-vocabulary input encoding is a principled, empirically effective strategy for neural language modeling, especially in domains or applications where WordPiece-style subword vocabularies are fragile or inadequate. Its combination of architectural minimalism and robust lexical generalization is a distinguishing factor in the evolving landscape of Transformer-based encoders.