Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Oolong: Investigating What Makes Transfer Learning Hard with Controlled Studies (2202.12312v2)

Published 24 Feb 2022 in cs.CL

Abstract: When we transfer a pretrained LLM to a new language, there are many axes of variation that change at once. To disentangle the impact of different factors like syntactic similarity and vocabulary similarity, we propose a set of controlled transfer studies: we systematically transform the language of the GLUE benchmark, altering one axis of crosslingual variation at a time, and then measure the resulting drops in a pretrained model's downstream performance. We find that models can largely recover from syntactic-style shifts, but cannot recover from vocabulary misalignment and embedding matrix re-initialization, even with continued pretraining on 15 million tokens. %On the other hand, transferring to a dataset with an unaligned vocabulary is extremely hard to recover from in the low-data regime. Moreover, good-quality tokenizers in the transfer language do not make vocabulary alignment easier. Our experiments provide insights into the factors of cross-lingual transfer that researchers should most focus on when designing language transfer scenarios.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Zhengxuan Wu (37 papers)
  2. Isabel Papadimitriou (13 papers)
  3. Alex Tamkin (29 papers)
Citations (8)

Summary

We haven't generated a summary for this paper yet.