A Monolingual Approach to Contextualized Word Embeddings for Mid-Resource Languages

Published 11 Jun 2020 in cs.CL | (2006.06202v2)

Abstract: We use the multilingual OSCAR corpus, extracted from Common Crawl via language classification, filtering and cleaning, to train monolingual contextualized word embeddings (ELMo) for five mid-resource languages. We then compare the performance of OSCAR-based and Wikipedia-based ELMo embeddings for these languages on the part-of-speech tagging and parsing tasks. We show that, despite the noise in the Common-Crawl-based OSCAR data, embeddings trained on OSCAR perform much better than monolingual embeddings trained on Wikipedia. They actually equal or improve the current state of the art in tagging and parsing for all five languages. In particular, they also improve over multilingual Wikipedia-based contextual embeddings (multilingual BERT), which almost always constitutes the previous state of the art, thereby showing that the benefit of a larger, more diverse corpus surpasses the cross-lingual benefit of multilingual embedding architectures.

Abstract PDF Upgrade to Chat

Authors (3)

Citations (213)

View on Semantic Scholar

Summary

The paper demonstrates that training ELMo on the expansive OSCAR corpus achieves state-of-the-art results in mid-resource languages.
It employs UDPipe 2.0 to evaluate and compare performance, showing monolingual models outperform multilingual alternatives like mBERT.
The research highlights that increased training epochs improve embedding quality without overfitting, even with noisier data.

A Monolingual Approach to Contextualized Word Embeddings for Mid-Resource Languages

The paper "A Monolingual Approach to Contextualized Word Embeddings for Mid-Resource Languages" by Pedro Javier Ortiz Suárez, Laurent Romary, and Benoît Sagot explores the effectiveness of monolingual contextualized word embeddings trained on large, diverse corpora from the OSCAR dataset compared to traditional Wikipedia sources. The study focuses on five typologically diverse mid-resource languages: Bulgarian, Catalan, Danish, Finnish, and Indonesian. The authors evaluate the performance of these embeddings on part-of-speech (POS) tagging and dependency parsing tasks, setting new state-of-the-art results in these areas.

Key Contributions and Methodology

The research utilizes the multilingual OSCAR corpus, extracted from Common Crawl and processed for noise reduction. The study contrasts this with more traditional, genre-specific data from Wikipedia. The authors employ ELMo embeddings for contextualized word representations, capitalizing on its capability to generate context-sensitive word vectors rather than single static vectors for each word. They evaluate these embeddings using UDPipe 2.0 for POS tagging and parsing, which integrates these representations into a robust parsing architecture.

Significant Findings

Corpus Quality vs. Size: The authors argue that despite potential noise, the OSCAR corpus provides a more extensive and diverse set of data than Wikipedia, resulting in more effective LLMs for mid-resource languages. Indeed, the OSCAR-based embeddings outperform Wikipedia-based embeddings in all evaluated tasks.
Comparison with State-of-the-Art Models: The OSCAR-trained ELMo models achieve superior results compared to multilingual approaches like mBERT, demonstrating that monolingual training on larger, varied datasets can surpass the cross-lingual capabilities of models like mBERT. This finding underscores the idea that a broad data spectrum in a single language can improve contextual word embedding quality even without cross-lingual context.
Evaluation of Noisiness: The study addresses concerns regarding the noise in Common Crawl datasets by sampling out-of-vocabulary (OOV) rates and finds that, when filtered, OSCAR's noise is minimally impactful compared to Wikipedia.
Training Epochs and Overfitting: The research examines the impact of training epochs on model performance, highlighting the absence of overfitting in smaller corpora models. Even with fewer data points, increasing training epochs leads to better embeddings for downstream tasks, indicating the importance of comprehensive model training.

Practical and Theoretical Implications

The practical implications of this research are significant for NLP applications in mid-resource languages, which often suffer from limited high-quality text data. The success of OSCAR-based ELMo models offers a pathway for improving NLP tasks in languages that do not have the same volume of quality data as high-resource languages. Theoretically, challenges in training effective multilingual models are highlighted, suggesting monolingual models using large, diverse corpora as a potential alternative for specific linguistic contexts.

Future Directions

This paper opens several avenues for further investigation. Exploring the scalability of similar approaches to low-resource languages could significantly enhance global NLP capabilities. Additionally, advancing filtration processes and noise reduction techniques in Common Crawl-derived datasets may further optimize the balance between data quantity and quality. Finally, expanding the infrastructure to accommodate resource-intensive models like BERT on similar data could provide a more exhaustive comparison and potentially broader applications.

In conclusion, this research contributes valuable insights into developing high-quality monolingual contextualized word embeddings for mid-resource languages, emphasizing the expansive potential of using larger, albeit noisier, datasets like OSCAR over traditional, quality-focused sources like Wikipedia.

Markdown Report Issue