Probing BLOOM models on languages not included in pretraining to enable typological analysis

Investigate the probing performance of the 176B-parameter BLOOM model and the 1.7B-parameter BLOOM-1B7 model on languages not included in the ROOTS pretraining corpus to support typological interpretation and identify which linguistic features are most and least learnable across unseen languages.

Background

The multilingual probing conducted in the paper focuses on languages represented in Universal Dependencies and included in pretraining. Extending probing beyond pretraining languages would test true cross-lingual generalization and reveal typological patterns in what linguistic properties these models capture without direct exposure.

The authors explicitly highlight the value of expanding the set of languages for probing to enable broader typological insights and a deeper understanding of feature learnability across diverse linguistic families.

References

It should be noted that the following questions remain for further research: 2. Multilingual abilities. A separate research interest implies considering languages that are not explicitly included in the pretraining corpus of the models. Expanding the set of languages for probing will allow for a typological interpretation and a deeper analysis of the most learnable and hard-to-learn linguistic features on a more considerable scope.

— BLOOM: A 176B-Parameter Open-Access Multilingual Language Model (2211.05100 - Workshop et al., 2022) in Section: Evaluation, Subsection: Multilingual Probing, Discussion

Probing BLOOM models on languages not included in pretraining to enable typological analysis

Background

References

Related Problems