Tik-to-Tok: Translating Language Models One Token at a Time: An Embedding Initialization Strategy for Efficient Language Adaptation (2310.03477v1)

Published 5 Oct 2023 in cs.CL and cs.AI

Abstract: Training monolingual LLMs for low and mid-resource languages is made challenging by limited and often inadequate pretraining data. In this study, we propose a novel model conversion strategy to address this issue, adapting high-resources monolingual LLMs to a new target language. By generalizing over a word translation dictionary encompassing both the source and target languages, we map tokens from the target tokenizer to semantically similar tokens from the source language tokenizer. This one-to-many token mapping improves tremendously the initialization of the embedding table for the target language. We conduct experiments to convert high-resource models to mid- and low-resource languages, namely Dutch and Frisian. These converted models achieve a new state-of-the-art performance on these languages across all sorts of downstream tasks. By reducing significantly the amount of data and time required for training state-of-the-art models, our novel model conversion strategy has the potential to benefit many languages worldwide.

PDF Abstract

Summarize Bookmark Chat (Pro)

Authors (5)

François Remy (10 papers)
Pieter Delobelle (15 papers)
Bettina Berendt (20 papers)
Kris Demuynck (20 papers)
Thomas Demeester (76 papers)

Citations (2)

View on Semantic Scholar

Tweets

https://twitter.com/MorenoLaQuatra/status/1787957931256017056

Tik-to-Tok: Translating Language Models One Token at a Time: An Embedding Initialization Strategy for Efficient Language Adaptation (2310.03477v1)

Related Papers

Tweets