Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Beyond Offline Mapping: Learning Cross Lingual Word Embeddings through Context Anchoring (2012.15715v2)

Published 31 Dec 2020 in cs.CL, cs.AI, and cs.LG

Abstract: Recent research on cross-lingual word embeddings has been dominated by unsupervised mapping approaches that align monolingual embeddings. Such methods critically rely on those embeddings having a similar structure, but it was recently shown that the separate training in different languages causes departures from this assumption. In this paper, we propose an alternative approach that does not have this limitation, while requiring a weak seed dictionary (e.g., a list of identical words) as the only form of supervision. Rather than aligning two fixed embedding spaces, our method works by fixing the target language embeddings, and learning a new set of embeddings for the source language that are aligned with them. To that end, we use an extension of skip-gram that leverages translated context words as anchor points, and incorporates self-learning and iterative restarts to reduce the dependency on the initial dictionary. Our approach outperforms conventional mapping methods on bilingual lexicon induction, and obtains competitive results in the downstream XNLI task.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Aitor Ormazabal (10 papers)
  2. Mikel Artetxe (52 papers)
  3. Aitor Soroa (29 papers)
  4. Gorka Labaka (15 papers)
  5. Eneko Agirre (53 papers)
Citations (11)