Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings (1805.06297v2)

Published 16 May 2018 in cs.CL, cs.AI, and cs.LG

Abstract: Recent work has managed to learn cross-lingual word embeddings without parallel data by mapping monolingual embeddings to a shared space through adversarial training. However, their evaluation has focused on favorable conditions, using comparable corpora or closely-related languages, and we show that they often fail in more realistic scenarios. This work proposes an alternative approach based on a fully unsupervised initialization that explicitly exploits the structural similarity of the embeddings, and a robust self-learning algorithm that iteratively improves this solution. Our method succeeds in all tested scenarios and obtains the best published results in standard datasets, even surpassing previous supervised systems. Our implementation is released as an open source project at https://github.com/artetxem/vecmap

A Robust Self-Learning Method for Fully Unsupervised Cross-Lingual Mappings of Word Embeddings

This research paper addresses the problem of unsupervised cross-lingual embedding mappings, proposing a novel method that effectively tackles the limitations of previous adversarial approaches when applied to more challenging linguistic scenarios. The authors emphasize that previous methods often rely on conditions such as comparable corpora or closely related languages, which limit their applicability in real-world settings.

Method Overview

The authors introduce a robust self-learning technique that obviates the need for parallel data or seed dictionaries, relying instead on the structural similarities present in monolingual embedding spaces. The method comprises four main stages:

  1. Embedding Normalization: Monolingual embeddings undergo a normalization process to ensure consistent similarity measures across languages.
  2. Unsupervised Initialization: The authors develop an innovative technique to align words across languages without supervision, based on the similarity distributions of words. By sorting these distributions, they construct initial pairings that provide a useful starting point for further learning.
  3. Self-Learning Iteration: The core of their approach is a self-learning algorithm that iteratively refines an initial weak solution. Key enhancements include stochastic dictionary induction, frequency-based vocabulary cutoffs, and bidirectional induction, leveraging Cross-domain Similarity Local Scaling (CSLS) for improved similarity measures.
  4. Symmetric Re-weighting: The method concludes with a refinement step that symmetrically re-weights the mapped embeddings to further improve alignment quality.

Empirical Evaluation

The paper demonstrates strong performance across several datasets covering a range of linguistic distances, including challenging language pairs such as English-Finnish. Achieving state-of-the-art results in bilingual lexicon extraction tasks, the proposed method surpasses previous supervised techniques, while maintaining robustness irrespective of initial conditions or hyperparameter sensitivity.

Implications and Future Work

The research presents significant advancements for unsupervised cross-lingual learning, opening avenues for its application in diverse languages and less conventional corpora. As natural language processing moves towards inclusive multilingual models, integrating such methods can expedite expansion without the heavy reliance on costly supervised data.

Future directions could involve extending this methodology to multilingual embeddings and incorporating phrase-level context, which would address tasks beyond word-level translation and contribute to more sophisticated applications like unsupervised machine translation.

The researchers provide their implementation openly, fostering further exploration and adaptation within the research community, and thereby promoting advancements in cross-linguistic representation learning.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Mikel Artetxe (52 papers)
  2. Gorka Labaka (15 papers)
  3. Eneko Agirre (53 papers)
Citations (574)
Github Logo Streamline Icon: https://streamlinehq.com