Pretraining Coverage of Low-Resource Languages in Qwen2.5-VL 3B

Ascertain whether the pretraining corpus of Qwen2.5-VL 3B includes exposure to the low-resource languages Kyrgyz (ky), Uzbek (uz), Georgian (ka), Lithuanian (lt), Latvian (lv), Bulgarian (bg), Macedonian (mk), and Malagasy (mg), so that multilingual finetuning and evaluation results can be correctly interpreted.

Background

The paper evaluates multilingual translation performance and explicitly distinguishes between high-resource and low-resource languages. For the low-resource set (ky, uz, ka, lt, lv, bg, mk, mg), the authors note uncertainty about whether the Qwen2.5-VL 3B model was exposed to these languages during pretraining.

This uncertainty motivates two finetuning strategies—pure-text and visual-text—to compare how training format affects performance. Establishing pretraining coverage would clarify the model’s prior knowledge and better contextualize observed gains and fertility differences across languages.

References

Since it is unclear whether Qwen2.5-VL 3B has seen these low-resource languages during pretraining, we finetune it using two methods: (i) pure text finetuning and (ii) visual-text finetuning.

— See the Text: From Tokenization to Visual Reading (2510.18840 - Xing et al., 21 Oct 2025) in Section 4.3 “Mltilingual Translation Evaluation,” Low-resource Languages

Pretraining Coverage of Low-Resource Languages in Qwen2.5-VL 3B

Background

References

Related Problems