Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CharacterBERT: Reconciling ELMo and BERT for Word-Level Open-Vocabulary Representations From Characters (2010.10392v3)

Published 20 Oct 2020 in cs.CL

Abstract: Due to the compelling improvements brought by BERT, many recent representation models adopted the Transformer architecture as their main building block, consequently inheriting the wordpiece tokenization system despite it not being intrinsically linked to the notion of Transformers. While this system is thought to achieve a good balance between the flexibility of characters and the efficiency of full words, using predefined wordpiece vocabularies from the general domain is not always suitable, especially when building models for specialized domains (e.g., the medical domain). Moreover, adopting a wordpiece tokenization shifts the focus from the word level to the subword level, making the models conceptually more complex and arguably less convenient in practice. For these reasons, we propose CharacterBERT, a new variant of BERT that drops the wordpiece system altogether and uses a Character-CNN module instead to represent entire words by consulting their characters. We show that this new model improves the performance of BERT on a variety of medical domain tasks while at the same time producing robust, word-level and open-vocabulary representations.

CharacterBERT: Enhancing BERT for Specialized Domains

The paper "CharacterBERT: Reconciling ELMo and BERT for Word-Level Open-Vocabulary Representations From Characters" introduces a novel approach to word-level representation in LLMs by replacing the conventional wordpiece tokenization system of BERT with a Character-CNN module. The purpose of this innovation is to address the shortcomings of predefined wordpiece vocabularies, particularly in specialized domains like medicine, where domain-specific terminology often results in inefficient subword breakdowns.

Motivation and Background

The adoption of BERT as the foundational architecture for NLP systems is widespread, primarily due to its successful implementation of bidirectional transformer models. BERT's reliance on wordpieces, a subword tokenization system, is advantageous for general domains but potentially limiting in contexts demanding highly specialized vocabularies. This paper acknowledges the growing demand in NLP for models tailored to specific domains, such as clinical or biomedical fields, where typical wordpiece vocabularies may not adequately capture the specificity of the language used.

The Core Contribution: CharacterBERT

CharacterBERT represents a significant shift from dependency on predefined wordpieces to a character-based approach to token representation. The authors implement a Character-CNN module, akin to ELMo’s early word-level systems, within the BERT architecture to generate open-vocabulary, robust word-level embeddings.

Key Features

  • Character-CNN Module: This module constructs word representations from a sequence of characters, maintaining context independence while moving to the deep contextual embeddings provided by transformer layers.
  • Absence of Predefined Vocabularies: CharacterBERT eschews the need for a pre-trained vocabulary, thereby reducing conceptual complexity and potential bias toward general-domain wordpieces.
  • Increased Robustness: Character-level modeling demonstrates resilience to input noise and misspelling—a common issue in textual data processing.

Evaluation and Results

Through comprehensive experimentation on several medical domain tasks—including entity recognition, sentence similarity, and relation classification—the CharacterBERT model shows improvement over BERT, particularly in environments where wordpiece vocabularies are inefficient. The authors present results from using CharacterBERT in comparison with general-domain models like BERT and specialized counterparts like BlueBERT. CharacterBERT demonstrates superior performance in tasks requiring finer domain representations and robustness to misspellings.

Implications and Future Work

The implications of the CharacterBERT approach are two-fold: practically enhancing domain-specific NLP task performance, and theoretically expanding the understanding and application of character-based models in transformer architectures. Avenues for future research include optimizing the pre-training architecture for speed, extending applicability across different languages, and refining robustness through advanced character sequence processing techniques.

This exploration into character-level tokenization as an alternative to wordpiece systems in transformers highlights the ability to adapt complex models like BERT to specialized contexts, marking a potential new direction in the efficient handling of domain-specific language processing tasks.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Hicham El Boukkouri (1 paper)
  2. Olivier Ferret (11 papers)
  3. Thomas Lavergne (5 papers)
  4. Hiroshi Noji (11 papers)
  5. Pierre Zweigenbaum (9 papers)
  6. Junichi Tsujii (3 papers)
Citations (152)
X Twitter Logo Streamline Icon: https://streamlinehq.com