Charformer: Fast Character Transformers via Gradient-based Subword Tokenization (2106.12672v3)

Published 23 Jun 2021 in cs.CL, cs.AI, and cs.LG

Abstract: State-of-the-art models in natural language processing rely on separate rigid subword tokenization algorithms, which limit their generalization ability and adaptation to new settings. In this paper, we propose a new model inductive bias that learns a subword tokenization end-to-end as part of the model. To this end, we introduce a soft gradient-based subword tokenization module (GBST) that automatically learns latent subword representations from characters in a data-driven fashion. Concretely, GBST enumerates candidate subword blocks and learns to score them in a position-wise fashion using a block scoring network. We additionally introduce Charformer, a deep Transformer model that integrates GBST and operates on the byte level. Via extensive experiments on English GLUE, multilingual, and noisy text datasets, we show that Charformer outperforms a series of competitive byte-level baselines while generally performing on par and sometimes outperforming subword-based models. Additionally, Charformer is fast, improving the speed of both vanilla byte-level and subword-level Transformers by 28%-100% while maintaining competitive quality. We believe this work paves the way for highly performant token-free models that are trained completely end-to-end.

PDF Abstract

Charformer: Enhancements in Character-Level Model Efficiency and Performance

The paper introduces Charformer, a novel architecture that integrates gradient-based subword tokenization (GBST) into a Transformer model, addressing the limitations faced by existing subword tokenization methods in NLP. The authors emphasize that traditional subword tokenization relies heavily on static algorithms, which restricts adaptability in multilingual contexts and resistance to noise, such as misspellings or non-standard language inputs. Charformer proposes an end-to-end model that autonomously learns subword representations from characters, optimizing both performance and efficiency.

Gradient-based subword tokenization (GBST) is the core innovation of this paper. GBST enhances model flexibility by enabling it to learn subword representations directly from character inputs through a process of enumerating and scoring candidate subword blocks. This mechanism bypasses the frequency-based segmentation of traditional methods and provides a mechanism for selecting optimal subword candidates input-wise, enhancing adaptability to unseen and low-resource language data contexts.

The paper evaluates Charformer's performance using the GLUE benchmark, multilingual tasks such as TyDiQA, XQuAD, and others, and tasks involving non-standard English. The results demonstrate Charformer's competitive edge, performing on par with or exceeding its subword-based counterparts like BERT and T5 in standard and noisy text environments. Specifically, Charformer achieves notable accuracy and efficiency gains, often training 28% faster than comparable models, with fewer parameters yet attaining similar or better results in robustness and cross-lingual transfer capabilities.

An intriguing component of Charformer is its character-level approach, which significantly benefits noise-prone and non-standard datasets. For instance, in toxicity detection tasks on social media data, Charformer outperforms byte and subword-level baselines, showcasing robustness against spelling variations and unconventional language formats. This indicates significant promise for real-world applications where data often deviates from structured language norms.

In the multilingual setting, Charformer also exhibits comparable or superior performance to established multilingual models when pre-trained on mC4 across 101 languages. Its end-to-end learning ability significantly reduces segmentation errors typically seen in languages with complex morphology or those with limited training data, enhancing cross-linguistic generalization.

Looking forward, the implications for AI and natural language processing are substantial. Charformer presents a scalable path for developing token-free models, potentially reducing reliance on labor-intensive language-specific pre-processing. Moreover, this model's adaptability suggests broader applications beyond NLP, including scenarios with similarly structured data or requiring robust handling of diverse input conditions.

In summary, Charformer offers a compelling solution for improving character-level modeling's efficiency and performance, especially in multilingual and noise-robust contexts. While it facilitates further research into refining GBST and exploring its potential across various domains, it also paves the way for the development of more fluent and adaptable AI systems.