Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Learning to Look Inside: Augmenting Token-Based Encoders with Character-Level Information (2108.00391v1)

Published 1 Aug 2021 in cs.CL

Abstract: Commonly-used transformer LLMs depend on a tokenization schema which sets an unchangeable subword vocabulary prior to pre-training, destined to be applied to all downstream tasks regardless of domain shift, novel word formations, or other sources of vocabulary mismatch. Recent work has shown that "token-free" models can be trained directly on characters or bytes, but training these models from scratch requires substantial computational resources, and this implies discarding the many domain-specific models that were trained on tokens. In this paper, we present XRayEmb, a method for retrofitting existing token-based models with character-level information. XRayEmb is composed of a character-level "encoder" that computes vector representations of character sequences, and a generative component that decodes from the internal representation to a character sequence. We show that incorporating XRayEmb's learned vectors into sequences of pre-trained token embeddings helps performance on both autoregressive and masked pre-trained transformer architectures and on both sequence-level and sequence tagging tasks, particularly on non-standard English text.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Yuval Pinter (41 papers)
  2. Amanda Stent (11 papers)
  3. Mark Dredze (66 papers)
  4. Jacob Eisenstein (73 papers)
Citations (7)
X Twitter Logo Streamline Icon: https://streamlinehq.com