Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Multilingual Culture-Independent Word Analogy Datasets (1911.10038v2)

Published 22 Nov 2019 in cs.CL

Abstract: In text processing, deep neural networks mostly use word embeddings as an input. Embeddings have to ensure that relations between words are reflected through distances in a high-dimensional numeric space. To compare the quality of different text embeddings, typically, we use benchmark datasets. We present a collection of such datasets for the word analogy task in nine languages: Croatian, English, Estonian, Finnish, Latvian, Lithuanian, Russian, Slovenian, and Swedish. We redesigned the original monolingual analogy task to be much more culturally independent and also constructed cross-lingual analogy datasets for the involved languages. We present basic statistics of the created datasets and their initial evaluation using fastText embeddings.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Kristiina Vaik (2 papers)
  2. Jessica Lindström (1 paper)
  3. Milda Dailidėnaitė (1 paper)
  4. Marko Robnik-Šikonja (39 papers)
  5. Matej Ulčar (8 papers)
Citations (14)

Summary

We haven't generated a summary for this paper yet.