Dice Question Streamline Icon: https://streamlinehq.com

Focused probing of under-resourced Indic and Niger-Congo languages and comparison with high-resource languages

Investigate morpho-syntactic probing results for under-resourced languages of the Indic and Niger-Congo families that were sparsely represented in the ROOTS pretraining corpus, and compare these results with high-resource languages to derive linguistic insights about performance disparities.

Information Square Streamline Icon: https://streamlinehq.com

Background

The dataset and probing analysis include under-resourced languages but in limited amounts relative to high-resource languages. Understanding how BLOOM and BLOOM-1B7 represent under-resourced languages is crucial for equitable multilingual performance.

The authors suggest a dedicated analysis of under-resourced Indic and Niger-Congo languages, and a comparative paper against high-resource languages, to reveal systematic linguistic and representational differences.

References

It should be noted that the following questions remain for further research: 3. Under-resourced language evaluation. The under-resourced languages of the Indic and Niger-Congo families included in the pretraining corpus in smaller shares represent a separate subject for future probing. We also plan to investigate the results of high-resourced and under-resourced languages to reveal possible linguistic insights in these two groups.

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model (2211.05100 - Workshop et al., 2022) in Section: Evaluation, Subsection: Multilingual Probing, Discussion