Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Hierarchical Document Encoder for Parallel Corpus Mining (1906.08401v2)

Published 20 Jun 2019 in cs.CL

Abstract: We explore using multilingual document embeddings for nearest neighbor mining of parallel data. Three document-level representations are investigated: (i) document embeddings generated by simply averaging multilingual sentence embeddings; (ii) a neural bag-of-words (BoW) document encoding model; (iii) a hierarchical multilingual document encoder (HiDE) that builds on our sentence-level model. The results show document embeddings derived from sentence-level averaging are surprisingly effective for clean datasets, but suggest models trained hierarchically at the document-level are more effective on noisy data. Analysis experiments demonstrate our hierarchical models are very robust to variations in the underlying sentence embedding quality. Using document embeddings trained with HiDE achieves state-of-the-art performance on United Nations (UN) parallel document mining, 94.9% P@1 for en-fr and 97.3% P@1 for en-es.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Mandy Guo (21 papers)
  2. Yinfei Yang (73 papers)
  3. Keith Stevens (6 papers)
  4. Daniel Cer (28 papers)
  5. Heming Ge (4 papers)
  6. Yun-Hsuan Sung (18 papers)
  7. Brian Strope (11 papers)
  8. Ray Kurzweil (11 papers)
Citations (23)

Summary

We haven't generated a summary for this paper yet.