How to (Properly) Evaluate Cross-Lingual Word Embeddings: On Strong Baselines, Comparative Analyses, and Some Misconceptions (1902.00508v1)

Published 1 Feb 2019 in cs.CL

Abstract: Cross-lingual word embeddings (CLEs) enable multilingual modeling of meaning and facilitate cross-lingual transfer of NLP models. Despite their ubiquitous usage in downstream tasks, recent increasingly popular projection-based CLE models are almost exclusively evaluated on a single task only: bilingual lexicon induction (BLI). Even BLI evaluations vary greatly, hindering our ability to correctly interpret performance and properties of different CLE models. In this work, we make the first step towards a comprehensive evaluation of cross-lingual word embeddings. We thoroughly evaluate both supervised and unsupervised CLE models on a large number of language pairs in the BLI task and three downstream tasks, providing new insights concerning the ability of cutting-edge CLE models to support cross-lingual NLP. We empirically demonstrate that the performance of CLE models largely depends on the task at hand and that optimizing CLE models for BLI can result in deteriorated downstream performance. We indicate the most robust supervised and unsupervised CLE models and emphasize the need to reassess existing baselines, which still display competitive performance across the board. We hope that our work will catalyze further work on CLE evaluation and model analysis.

Authors (4)

Robert Litschko (19 papers)
Sebastian Ruder (93 papers)
Goran Glavas (3 papers)
Ivan Vulic (1 paper)

Citations (182)

View on Semantic Scholar

Summary

Evaluating Cross-Lingual Word Embeddings: Beyond Bilingual Lexicon Induction

The paper "How to (Properly) Evaluate Cross-Lingual Word Embeddings: On Strong Baselines, Comparative Analyses, and Some Misconceptions" by Goran Glavaš et al., provides an in-depth examination of cross-lingual word embeddings (CLEs), highlighting the need for comprehensive evaluation methodologies beyond the predominant task of bilingual lexicon induction (BLI). The authors aim to reassess current CLE models by scrutinizing their performance across both intrinsic and extrinsic tasks, presenting a more holistic view of their capabilities in cross-lingual NLP applications.

Key Findings and Contributions

The research underscores the prevalent issue whereby CLE models are excessively optimized for BLI, often resulting in suboptimal performance in downstream NLP tasks. The paper conducts a rigorous evaluation of supervised and unsupervised CLE models on BLI and three downstream tasks—cross-lingual document classification (CLDC), cross-lingual information retrieval (CLIR), and cross-lingual natural language inference (XNLI). Notably, it reveals that BLI performance does not consistently predict downstream task efficacy, an insight that challenges the current evaluation paradigms.

The paper identifies the robust performance of models like RCSLS in BLI but points out their diminished effectiveness in tasks like XNLI and CLIR, highlighting a divergence between BLI-specific optimization and broader cross-lingual utility. Conversely, VecMap demonstrates consistent robustness in both supervised and unsupervised settings across various languages, including less studied language pairs. This leads to the recommendation of VecMap as a reliable unsupervised baseline for future research.

Methodological Consistency and Recommendations

The paper advocates for a standardized and rigorously defined evaluation protocol across CLE research. It emphasizes aligning training and evaluation dictionaries across language pairs to ensure fair comparisons, and it critiques the common neglect of statistical significance testing in BLI results. The proposed holistic evaluation framework, encompassing a spectrum of tasks and languages, aims to guide future CLE model development and benchmarking.

Implications and Future Directions

Practically, these findings imply a need for researchers and developers to consider multiple tasks when designing and evaluating CLEs, moving beyond the myopic focus on BLI. Theoretically, this work challenges researchers to rethink how cross-lingual semantic spaces are evaluated, emphasizing the importance of task diversity to capture the nuanced properties of embedding models.

The research opens several avenues for future exploration, particularly in understanding the nature of projection distortions in non-orthogonal models like RCSLS and their impact on downstream tasks. Additionally, the paper calls for more nuanced diagnostics that analyze the structural and semantic quality of shared embedding spaces, particularly for under-resourced and typologically diverse languages.

By elucidating these critical insights and setting new standards for comprehensive CLE evaluation, this paper lays a foundation for continued advancements in multilingual NLP, ultimately aiming for more flexible and application-agnostic cross-lingual strategies.

PDF Markdown