Word Analogy Task in Embedding Evaluation

Updated 7 June 2026

The word analogy task is a foundational evaluation paradigm that uses vector arithmetic to reveal semantic and syntactic relationships in word embeddings.
Key methodologies include vector offset approaches like 3CosMul and supervised relational learning, leveraging cosine similarity and high-dimensional features.
Recent advances address limitations through refined metrics, multilingual extensions, and novel geometries such as hyperbolic spaces for robust analogical reasoning.

The word analogy task is a well-established evaluation paradigm for probing how semantic and syntactic relationships are encoded within word embeddings. Its standard formulation—solving queries of the form “a is to b as c is to ?”—has driven both methodological innovation in distributional semantics and the design of numerous multilingual benchmarks and computational models. The task's centrality arises from its capacity to reveal both linear regularities and fundamental limitations of vector space models, exposing issues of relational generalization, language morphology, and the geometry of underlying embeddings.

1. Formal Definitions and Standard Evaluation Protocols

The canonical word analogy query consists of four words, encoded as $(a, b, c, d)$ , with the relationship $a~:~b~::~c~:~d$ . In modern embedding-based formulations, the goal is to find $d$ such that the offset between $a$ and $b$ mirrors the offset between $c$ and $d$ . The dominant computational approach is the vector offset method (VOM), usually instantiated as:

$\hat{d} = \arg\max_{w\in V \setminus \{a,b,c\}} \cos\left(\mathbf{v}_w,\, \mathbf{v}_b - \mathbf{v}_a + \mathbf{v}_c\right)$

where $\mathbf{v}_x$ is the embedding of $x$ , $a~:~b~::~c~:~d$ 0 is the vocabulary, and $a~:~b~::~c~:~d$ 1 denotes cosine similarity (Fanaeepour et al., 2018, Ulčar et al., 2019).

Alternative formulations include 3CosMul (Fanaeepour et al., 2018):

$a~:~b~::~c~:~d$ 2

which sharpens ranking by leveraging multiplicative rather than additive composition.

Primary evaluation uses top-1 accuracy (fraction of queries where $a~:~b~::~c~:~d$ 3), often extended to top- $a~:~b~::~c~:~d$ 4 accuracy (Hit@ $a~:~b~::~c~:~d$ 5), mean reciprocal rank (MRR), and other information-retrieval metrics—especially when multiple correct answers exist or ranking quality is critical (Newman-Griffis et al., 2017).

2. Methods for Solving the Analogy Task

2.1 Vector Arithmetic in Euclidean Embedding Spaces

Early and still widely used approaches rely on word2vec-style or GloVe-style embeddings, utilizing their (approximate) linear structure to encode relational regularities (Fanaeepour et al., 2018). The relational vector is operationalized as $a~:~b~::~c~:~d$ 6; analogy completion is thus a nearest-neighbor search after vector addition.

Embedding hyperparameter choices (window size, vector dimension, training corpus) can have significant impact: optimal performance is typically achieved with window sizes $a~:~b~::~c~:~d$ 7– $a~:~b~::~c~:~d$ 8, dimensions around $a~:~b~::~c~:~d$ 9 for mixed tasks, and corpora of several million tokens (Fanaeepour et al., 2018). Character and subword-level models (e.g., fastText) further improve coverage, especially in morphologically rich languages (Ulčar et al., 2019, Svoboda et al., 2016).

2.2 Supervised Relational Vector Learning

Beyond naive vector offsets, supervised models learn relation types or relational features from data:

PairClass/High-Level Perception Models: These represent word pairs with high-dimensional vectors derived from millions of pattern-extracted contexts; classification is performed with SVMs, attaining strong performance on SAT-style and synonym/antonym benchmarks (0809.0124, Turney, 2011).
SuperSim: High-dimensional relational features (PPMI, domain-space and function-space cosines) are used in an SVM with a polynomial kernel to measure proportional analogy (Turney, 2013).

These approaches frame analogical reasoning as supervised classification of word pairs or quadruples, often substantially boosting performance in tasks involving nuanced or rare relations.

2.3 Cross-Lingual and Morphologically Rich Adaptations

Aligning semantic spaces across languages typically involves linear transformations using bilingual dictionaries (e.g., Orthogonal Procrustes, CCA), yielding bilingual or multilingual joint spaces that preserve analogical structure (Brychcín et al., 2018). Evaluation on cross-lingual analogy datasets (e.g., English–German, or 9-language sets in (Ulčar et al., 2019)) confirms that much—though not all—relational structure can be maintained under linear mapping, with decreased accuracy as the alignment spans more typologically diverse languages.

For languages with rich morphology, both the design of analogy datasets (covering inflections, derivations, and regular/irregular paradigms) and the specific embedding architectures (e.g., subword information in fastText, character-level DIEM in (Trask et al., 2015)) are crucial for high accuracy (Kurniawan, 2019, Svoboda et al., 2016, Trask et al., 2015).

2.4 Hyperbolic and Alternative Geometries

Embeddings in hyperbolic spaces (Poincaré ball, hyperboloid) have been proposed to encode hierarchies and tree-like relations (Leimeister et al., 2018, Saxena et al., 2022). Analogy computation in curved spaces requires geometric operations such as parallel transport and exponential/log maps; in low-dimensional regimes, hyperbolic embeddings can outperform Euclidean, though in higher dimensions and standard benchmarks, Euclidean embeddings remain competitive (Leimeister et al., 2018, Saxena et al., 2022). Empirically, anchoring analogy computation to the Euclidean chart of hyperbolic embeddings is more effective than using intrinsic hyperbolic algebra for current models (Saxena et al., 2022).

3. Benchmark Datasets and Task Variants

A wide spectrum of datasets exists for monolingual, multilingual, biomedical, and domain-specific settings:

Dataset	Languages	Focus	Size (#analogies)	Notable Properties
Google Analogy	English	Sem/Syn	19,544	Country-capital, gender, verb inflection
BATS	English	Detailed morph/sem	98,000	10 categories, inflectional+semantic
KaWAT	Indonesian	Morph/Sem	34,000	Indonesian morphology, low OOV
Czech analogies	Czech	Morph/Sem	22,257	Rich morphology, 12+ categories
BMASS	English (Bio)	Biomedical relations	61,250	UMLS-derived, multi-answer
Multilingual set	9 languages	Perf balanced	18,000–20,000	Culturally neutral, cross-lingual
Persian SAT-analogy	Persian	Semantic	5,000	67 classes, formal/colloquial division
Cross-lingual (Brychcín et al.)	6 lang	Jointly mapped	~15–20k/language-pair	Linear alignment, multi-family

Datasets typically define semantic and syntactic categories for fine-grained evaluation, with recent advances placing special emphasis on non-English and morphologically complex languages (Ulčar et al., 2019, Svoboda et al., 2016, Mahmoudi et al., 2021).

4. Extensions, Criticisms, and Nuanced Evaluation

4.1 Limitations of Standard Analogy Tasks

Analogy benchmarks have been criticized for:

Proximity vs. Relational Signal: Most of the information in analogy completion is explained by proximity to the query word, not true analogical inference via relation vectors; the analogy offset term contributes only 1–2 bits of information (Montalvão, 2022).
Single-answer Assumption: Real-world analogies often admit multiple correct answers; restricting to top-1 accuracy misrepresents model performance (Newman-Griffis et al., 2017).
Assumed Sameness of Relation: Pairs may instantiate overlapping or ambiguous relationships, complicating transferability of offsets (Newman-Griffis et al., 2017).

4.2 Soft Accuracy, Entropy, and Information Content

New metrics such as soft accuracy and entropy-based information content provide a more refined assessment, disentangling information due to proximity from that genuinely contributed by analogy (Montalvão, 2022). Reporting mean reciprocal rank (MRR) and MAP is essential when multiple answers are possible (Newman-Griffis et al., 2017).

4.3 Generalized and One-to-X Analogy Tasks

Generalizations include one-to-X and one-to-none analogy formulations, enabling prediction of all valid targets (or none) for a given query, with cosine-thresholding mechanisms to filter false positives (Kutuzov et al., 2019). These settings better capture the multi-valued nature of real-world relations.

4.4 Biomedical and Domain-Specific Extensions

Evaluation on domain-specific relations—e.g., drug–gene, protein–protein, or clinical concept analogies—demonstrates that the analogy paradigm extends beyond linguistic regularities to capture factual associations embedded in biomedical texts (Newman-Griffis et al., 2017, Yamagiwa et al., 2024). Averaged relation vectors and pathway-aware grouping improve predictive accuracy in these tasks (Yamagiwa et al., 2024).

5. Practical Insights & Recommendations

Hyperparameters: For general-purpose analogy tasks in English and related languages, use context window size 2–4, vector dimension $d$ 0200, and corpora $d$ 14M tokens to saturate performance (Fanaeepour et al., 2018).
Domain- or language-specific tuning: Morphologically rich, low-resource, or highly inflected languages benefit from explicit subword modeling and tailored analogy datasets with appropriate syntactic/semantic coverage (Ulčar et al., 2019, Svoboda et al., 2016, Mahmoudi et al., 2021, Kurniawan, 2019).
Modeling order: Incorporating word-order or character-level cues (e.g., via WOVe or PENN+DIEM) yields substantial accuracy gains on analogy tasks, especially in syntactic or morphological categories (Ibrahim et al., 2021, Trask et al., 2015).
Evaluation: Always report per-category breakdowns, and where possible, complementary metrics such as Hit@ $d$ 2, MRR, MAP, and full error analysis (Ulčar et al., 2019, Newman-Griffis et al., 2017, Montalvão, 2022).
Cross-lingual analogies: Linear mappings suffice for preserving analogical structure across languages when robust bilingual dictionaries and subword-aware embeddings are used (Brychcín et al., 2018).

6. Future Directions

Emerging research emphasizes:

Information-theoretic evaluation: Adoption of entropy-based and soft accuracy measures to reveal the true informational content of analogy completion (Montalvão, 2022).
Richer evaluation paradigms: Designing analogy datasets that control for proximity-based shortcuts, include challenging distractors, and support multi-label and diachronic evaluation (Newman-Griffis et al., 2017, Kutuzov et al., 2019).
Extension to complex relational settings: Application to scientific, medical, and historical knowledge prediction, including diachronic and cross-domain relation discovery (Yamagiwa et al., 2024, Kutuzov et al., 2019).
Manifold-aware and higher-order embeddings: Investigating curved geometries (hyperbolic, spherical), dynamic and context-sensitive offsets, and compositional models to better capture asymmetric and context-dependent relations (Leimeister et al., 2018, Saxena et al., 2022, Trask et al., 2015).

The word analogy task remains a critical instrument for probing relational structure in word embeddings, but state-of-the-art research increasingly calls for more robust, interpretable, and task-aligned evaluation frameworks, especially as embeddings are deployed in diverse, multilingual, and domain-specific applications.