Word Analogy Task Performance

Updated 22 November 2025

Word analogy task performance is a measure that evaluates how well vector embeddings capture linguistic, morphological, and semantic relations via arithmetic and neural operations.
It covers diverse methodologies including simple offset arithmetic, hyperbolic geometry, and character-level models, benchmarked on datasets like Google and MSR.
The analysis highlights practical gains from word order and domain-specific models while outlining open challenges in multilingual and contextual analogy evaluations.

Word analogy task performance quantifies the ability of vector-based representations to encode and recover linguistic, morphological, or semantic regularities via geometric operations. It has evolved into a primary intrinsic evaluation for word embeddings and related models, with methodologies spanning simple vector arithmetic, neural retrieval frameworks, and corpus-based relation classifiers. The following sections survey foundational formulations, datasets, architectural innovations, evaluation protocols, multilingual and domain-specific variants, and empirical insights from the arXiv literature.

1. Mathematical Formulation and Protocols

The canonical word analogy task asks, given a quadruple $(a : b :: c : ?)$ , to predict $d$ such that $a$ relates to $b$ as $c$ relates to $d$ . In vector space models, this is typically framed as an offset operation: compute $v_{b} - v_{a} + v_{c}$ , then return $d^{*} = \arg\max_{d \notin \{a,b,c\}} \cos(v_d, v_b - v_a + v_c)$ (Wang et al., 2019). Variants include 3CosMul scoring [Levy & Goldberg], which normalizes relation strength: $b^* = \arg\max_{w \notin \{a, a^*, b\}} \frac{\cos(v_w, v_b)\cdot\cos(v_w, v_{a^*})}{\cos(v_w, v_a) + \epsilon}$ with $\epsilon$ added for stability.

Neural models extend this by directly learning mappings $F(a,b,c) \approx v_d$ , minimizing normalized MSE or cross-entropy loss over candidate solutions (Marquer et al., 2023). Supervised approaches (e.g., PairClass (0809.0124, Turney, 2011)) encode relation patterns in sparse high-dimensional vectors and feed them to SVMs (RBF kernels), framing the analogy as relation classification.

For non-Euclidean geometry, hyperbolic and Poincaré ball models reformulate analogy as parallel transport and exponential map operations:

Compute $w = \text{Log}_A(B) \in T_AH$ .
Transport $w$ to $C$ : $w' = \varphi_{A \rightarrow C}(w)$ .
Exponentiate $w'$ at $C$ to get $Z$ ; return $d^*$ as the nearest neighbor to $Z$ under hyperbolic distance (Leimeister et al., 2018, Saxena et al., 2022).

2. Datasets and Cultural/Linguistic Dimensions

Standard evaluation sets include:

Google Analogy Dataset: 19,544 analogies—8,869 semantic (capitals, currencies, family), 10,675 syntactic (inflections, tense, plurals, comparatives) (Wang et al., 2019, Trask et al., 2015).
MSR Syntactic: 8,000 pure syntactic analogies (Wang et al., 2019, Ulčar et al., 2019).
BATS: 2,880 English analogies emphasizing diverse relation types (Wijesiriwardene et al., 2023).
SAT, TOEFL, ESL: Smaller sets designed for relation classification and synonym/antonym detection (0809.0124, Turney, 2011).

Multilingual and culturally neutral datasets (e.g., (Ulčar et al., 2019, Brychcín et al., 2018)) provide analogies for nine languages across both semantic and morphological classes. Cross-lingual benchmarks use linear mappings (Orthogonal Procrustes, CCA) to transfer analogical relations across languages, attaining Acc@1 up to 43.1% in bilingual and 38.2% in six-lingual hubs (Brychcín et al., 2018, Saxena et al., 2022).

In domain-specific contexts, biomedical variants (BMASS (Newman-Griffis et al., 2017), drug–gene (Yamagiwa et al., 2024)) adapt the core protocol for domain ontologies and entity relations. Indonesian, Japanese, and morphological datasets extend coverage to lower-resource languages and morphological systems (Kurniawan, 2019, Marquer et al., 2023).

3. Model Architectures and Embedding Innovations

Widely used models include:

SGNS (Skip-Gram w/ Negative Sampling), CBOW, FastText (subword n-grams), GloVe (global co-occurrence).
Partitioned and morphology-aware models: PENN/DIEM partition embeddings by context position and word-character structure, respectively, yielding syntactic analogy accuracy of 88.29% and a 58% error reduction over GloVe (Trask et al., 2015).
WOVe (Word Order Vector Embedding) concatenates per-position GloVe vectors, achieving a mean 36.34% improvement in analogy accuracy compared to vanilla GloVe (Ibrahim et al., 2021).
Deep learning frameworks for morphological analogies combine CNN or autoencoder-based embeddings with analogy-aware retrieval or generative models (ANNc, ANNr), outperforming symbolic baselines in highly inflectional languages (Marquer et al., 2023).
Cross-lingual models project semantic spaces via linear transformations or operate in hyperbolic geometry (Poincaré ball), capturing hierarchical relations and facilitating low-dimensional transfer (Brychcín et al., 2018, Saxena et al., 2022).

Sub-word and character-based approaches, including BiLSTM encoders over character sequences, provide effective topology for syntactic analogies, but, as shown in (Stratos, 2017), cannot match distributional word-level models on semantic analogies.

4. Evaluation Metrics and Methodological Insights

Performance metrics include:

Accuracy@1 (fraction of exact matches); k-nearest variants (Accuracy@k); MRR (Mean Reciprocal Rank); MAP (Mean Average Precision) for multi-answer tasks (Newman-Griffis et al., 2017).
In LLM evaluations (ANALOGICAL benchmark (Wijesiriwardene et al., 2023)), mean normalized Mahalanobis, Euclidean, and cosine distances are reported over analogical pairs; lower values correspond to better clustering of related words.

Critical methodological insights from biomedical analogy completion (Newman-Griffis et al., 2017):

Standard accuracy assumes a single correct answer and a unique relation, often violated in multi-faceted or ontology-rich domains.
MAP/MRR provide richer views for multiple correct answers.
Averaging offsets over multiple example pairs can strengthen the relation signal.
Topological or geometric biases in embedding spaces (e.g., hyperbolic hierarchy) manifest as performance deviations by dimension (Leimeister et al., 2018, Saxena et al., 2022).

5. Empirical Results and Comparative Analysis

Classical skip-gram models achieve 73–78% accuracy on semantic analogies (Google) and 67% on syntactic (MSR); FastText exhibits strong syntactic performance but lags in semantic classes (Wang et al., 2019). Multilingual FastText embeddings achieve up to 95% in “capitals & countries” (English) but only 28% (Slovene), with accuracy@5 mitigating cross-lingual limitations (Ulčar et al., 2019).

Morphological analogy solvers leveraging character-level neural embeddings reach top-1 accuracies up to 98% in high-resource languages, dramatically outperforming symbolic algorithms (Alea, Kolmo), which fail on irregular forms (Marquer et al., 2023). Biomedically trained embeddings (BioConceptVec) enable drug–gene prediction via analogy arithmetic, achieving top-1 accuracy of ∼30% and top-10 of ∼70%, on par with general linguistic benchmarks (Yamagiwa et al., 2024).

Word order–aware models (WOVe, PENN) yield 25–57% relative improvements in analogy accuracy, especially for syntactic patterns, demonstrating the advantage of encoding context positions (Ibrahim et al., 2021, Trask et al., 2015).

LLMs (BERT, XLNet, RoBERTa, etc.) cluster synonym/hypernym pairs but perform poorly on proportional analogies (mean MD≈1.0 for Google, SAT)—surfacing a gap in capturing implicit offset-based relations (Wijesiriwardene et al., 2023).

6. Analysis by Linguistic and Domain Factors

Performance by linguistic category and language type shows:

Morphologically rich languages (Finnish, Slovene, Croatian) suffer lower analogy accuracy in FastText, mainly due to homonymy and inflectional complexity; k-nearest approaches partly alleviate this (Ulčar et al., 2019).
Syntactic analogies are strongly supported by character/sub-word models, as BiLSTM+ReLU captures most productive morphological processes “for free” (Stratos, 2017).
Semantic analogies require distributional embedding of word co-occurrence, with sub-word models unable to recover topology.
Cross-lingual analogies using linear and hyperbolic mappings perform best in low dimensions or with highly regular relations; semantic category transfer (state-currency) remains challenging across language families (Brychcín et al., 2018, Saxena et al., 2022).

Biomedical and morphological benchmarks reveal that multi-answer evaluation and offset averaging clarify model limitations in domain-specific relations (Newman-Griffis et al., 2017, Marquer et al., 2023).

7. Recommendations, Open Issues, and Future Directions

Advised methodological practices include:

Report MAP/MRR and multi-answer metrics, particularly in ontology-rich or multi-relation datasets (Newman-Griffis et al., 2017).
For morphological analogies, prefer character-level CNNs or autoencoders with neural retrieval/generation; augment training quadruples via analogical axioms (Marquer et al., 2023).
For multilingual analogy transfer, apply CCA for bilingual spaces and Orthogonal Procrustes for large multi-language hubs, but control dictionary size to avoid degradation (Brychcín et al., 2018).
Integrate explicit context/word-order encoding to strengthen syntactic analogy retrieval (Ibrahim et al., 2021, Trask et al., 2015).
For domain adaptation (biomedical, legal, technical), derive path-wise or year-wise relation vectors to enhance zero-shot prediction (Yamagiwa et al., 2024).
Examine embedding substructure: partitioned models reveal that distinct context positions contribute heterogeneously to analogy categories (Trask et al., 2015).

Open research directions concern:

Extending analogy protocols to contextual or sentence-level analogies for LLMs (Wijesiriwardene et al., 2023, Ulčar et al., 2019).
Formulating true hyperbolic analogy solvers using parallel transport and Möbius operations, closing the gap between geometry and evaluation (Leimeister et al., 2018, Saxena et al., 2022).
Balancing semantic and syntactic challenge across cultures and languages, expanding culturally neutral and multilingual datasets (Ulčar et al., 2019).
Exploring richer hybridization of corpus-based patterns and lexicon knowledge to bootstrap low-resource languages and rare relations (Turney, 2011, 0809.0124).

In summary, word analogy task performance is a powerful lens for probing the semantic, syntactic, and relational capacities of vector-based language representations. The ongoing evolution of datasets, modeling techniques, and evaluation protocols continues to shape both foundational linguistic understanding and domain-specific applications.