Pretraining Coverage of Low-Resource Languages in Qwen2.5-VL 3B
Ascertain whether the pretraining corpus of Qwen2.5-VL 3B includes exposure to the low-resource languages Kyrgyz (ky), Uzbek (uz), Georgian (ka), Lithuanian (lt), Latvian (lv), Bulgarian (bg), Macedonian (mk), and Malagasy (mg), so that multilingual finetuning and evaluation results can be correctly interpreted.
References
Since it is unclear whether Qwen2.5-VL 3B has seen these low-resource languages during pretraining, we finetune it using two methods: (i) pure text finetuning and (ii) visual-text finetuning.
— See the Text: From Tokenization to Visual Reading
(2510.18840 - Xing et al., 21 Oct 2025) in Section 4.3 “Mltilingual Translation Evaluation,” Low-resource Languages