Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

How Can We Effectively Expand the Vocabulary of LLMs with 0.01GB of Target Language Text? (2406.11477v2)

Published 17 Jun 2024 in cs.CL and cs.AI

Abstract: LLMs have shown remarkable capabilities in many languages beyond English. Yet, LLMs require more inference steps when generating non-English text due to their reliance on English-centric tokenizers and vocabulary, resulting in higher usage costs to non-English speakers. Vocabulary expansion with target language tokens is a widely used cross-lingual vocabulary adaptation approach to remedy this issue. Despite its effectiveness in inference speedup, previous work on vocabulary expansion has focused on high-resource settings assuming access to a substantial amount of target language data to effectively initialize the embeddings of the new tokens and adapt the LLM to the target language. However, vocabulary expansion in low-resource settings has yet to be explored. In this paper, we investigate vocabulary expansion in low-resource settings by considering embedding initialization methods and continual pre-training strategies. Through extensive experiments across typologically diverse languages, tasks and models, we establish a set of strategies to perform vocabulary expansion for faster inference, maintaining competitive downstream performance to baselines with only 30K sentences ($\sim$0.01GB text data) from the target language.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Atsuki Yamaguchi (11 papers)
  2. Aline Villavicencio (31 papers)
  3. Nikolaos Aletras (72 papers)
Citations (2)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets