Cross-Lingual Lexical Neighborhoods

Updated 24 January 2026

Cross-Lingual Lexical Neighborhoods are defined as sets of words in one language that are semantically aligned with a word from another language within a shared embedding space.
They are computed using methodologies such as cosine similarity, bilingual dictionary alignment, and transformer fine-tuning to support applications like translation and topic modeling.
Practical applications include bilingual lexicon induction, semantic similarity evaluation, and cross-lingual topic modeling, driving advances in multilingual NLP.

A cross-lingual lexical neighborhood is the set of word types or concepts in one language that are semantically or distributionally nearest to a given word in another language within a shared representational space. This construct is central to multilingual NLP, lexical semantics, cross-lingual alignment, and tasks including translation, topic modeling, lexicon induction, and semantic similarity evaluation. Cross-lingual lexical neighborhoods can be defined mathematically, inferred from embedding spaces, or constructed empirically using translation dictionaries and multilingual resources, with increasing sophistication from simple static models to context-sensitive and domain-specific metrics.

1. Formal Definitions and Principal Constructions

At the core, a cross-lingual lexical neighborhood for a word $w$ in language $L_1$ comprises its top- $k$ most semantically similar words $N_C(w)$ in another language $L_2$ , or equivalently, the set of $L_2$ words mapping closest to $w$ under a predefined similarity metric in a joint or aligned embedding space. For word embeddings $e_w \in \mathbb{R}^d$ and vocabulary $V^{(l)}$ :

Cosine-based neighborhood:

$N_C(w) = \operatorname{arg\,max}_{S \subset V^{(L_2)}, |S|=k} \sum_{v \in S} \text{cos}(e_w, e_v)$

as formalized in GloCTM (Phat et al., 17 Jan 2026) and Multi-SimLex (Vulić et al., 2020).

Augmented neighborhoods in topic models: The full cross-lingual lexical neighborhood includes intra-lingual neighbors $L_1$ 0 from $L_1$ 1 and $L_1$ 2 from $L_1$ 3, i.e., $L_1$ 4 (Phat et al., 17 Jan 2026).
Polysemy-based semantic proximity: In cognitive typology and lexicography, the neighborhood can be induced by counting the number of polysemous links (shared word-forms) that connect a pair of concepts $L_1$ 5, $L_1$ 6 across a stratified language sample, with neighborhood relations captured in a weighted undirected graph $L_1$ 7 (Youn et al., 2015).
Contextualized variants: With contextualized LLMs, neighborhoods can be defined using pooled representations or point cloud distances over contextualized embeddings (see SNC-STATIC, SNC-AVE, SNC-CLOUD metrics) (Karidi et al., 2024).

2. Methodologies for Inducing and Leveraging Cross-Lingual Lexical Neighborhoods

A variety of strategies have been developed for constructing and exploiting cross-lingual lexical neighborhoods, contingent on supervision, resource availability, and target task:

Polyglot Embeddings: Training a single skip-gram model on a mixed-language corpus yields a shared space for all participating languages. Neighborhoods are computed via constrained nearest-neighbor search across language boundaries, often with back-translation filtering to avoid hubness (KhudaBukhsh et al., 2020).
Alignment via Bilingual Dictionaries: Mapping monolingual embedding spaces using Procrustes/orthogonal transformations with seed dictionaries, followed by nearest-neighbor queries under cosine or CSLS, is widespread. More robust approaches build the cross-lingual space directly with context anchoring, which uses translated contexts as anchor points and iterative self-learning (dictionary-induced context replacement in skip-gram) (Ormazabal et al., 2020).
Specialization of Multilingual Transformers: Contrastive fine-tuning using large (possibly BabelNet-derived) synonym pairs, applied to MMTs (e.g., mBERT, XLM-R), refines the "latent" structure to produce high-quality, type-level cross-lingual neighborhoods. This can be accomplished via full fine-tuning or parameter-efficient adapters (Green et al., 2022).
Contextualized Probing and Neighborhood Comparison: For word-level or domain-level semantic alignment, local neighborhoods are compared between translation pairs, using cosine similarity over static or contextualized representations, and alignment scores are computed as neighborhood overlap ratios or correlation of distance profiles (SNC metrics) (Karidi et al., 2024). These approaches allow granular evaluation of semantic alignment, capturing local or domain-specific divergences.
Fine-tuning Neural LMs for LRLs: Parameter-efficient fine-tuning (e.g., LoRA adapters) targeted at layers with naturally high cross-lingual similarity can propagate alignment to final output layers in LLMs (Targeted Lexical Injection) (Ngugi, 18 Jun 2025).
Graph-theoretic Analysis of Polysemy: Neighborhoods derived from polysemy networks across many languages expose universal and language-specific conceptual structures, validated via strong clustering and stability across cultural and environmental strata (Youn et al., 2015).

3. Empirical and Algorithmic Workflows

Key algorithmic components for building cross-lingual lexical neighborhoods are summarized in the table below.

Step	Methodologies	Key Details / Models
Vocabulary embedding	Polyglot skip-gram, static CLWE, MMTs	FastText, mBERT, XLM-R, LSTM seq2seq shared encoders
Similarity metric	Cosine, CSLS, contextualized dot-products	Joint space: $L_1$ 8
Neighborhood retrieval	Top- $L_1$ 9, threshold $k$ 0, back-translation	Pseudocode as in (Phat et al., 17 Jan 2026, KhudaBukhsh et al., 2020)
Supervision	None / weak / full seed lexicon	Context anchoring, self-learning, contrastive BabelNet
Validation	BLI P@k, MRR, Spearman’s $k$ 1 (XLSIM)	Multi-SimLex, domain-level scores, Tatoeba, kinship gaps
Augmentation of document BoW	Polyglot augmentation (GloCTM)	Enrich input BoW with cross-lingual neighbors
Losses for alignment	InfoNCE, KL-div, triplet margin, CKA loss	Local-global VAE, semantic grounding, LLM finetuning

Pseudocode for polyglot augmentation from (Phat et al., 17 Jan 2026) illustrates document-level embedding augmentation by including both intra- and cross-lingual neighbors in an extended bag-of-words:

$N_C(w)$ 0

4. Evaluation, Metrics, and Empirical Results

Evaluation of cross-lingual lexical neighborhoods is carried out at multiple granularities:

Type-level lexicon induction (BLI): The standard is precision@k, MRR over gold bilingual dictionaries. Specializing MMTs with cross-lingual synonym pairs increases BLI MRR from 14.5 to 20.9 and Multi-SimLex $k$ 2 from 0.103 to 0.258 for mBERT (Green et al., 2022).
Semantic similarity (XLSIM): Spearman’s $k$ 3 between gold and model-predicted word-pair similarities; improvements via context anchoring and contrastive specialization (e.g., Multi-SimLex $k$ 4 up to 0.57 with cross-lingual SEs after contrastive tuning) (Vulić et al., 2022, Green et al., 2022).
Neighborhood overlap/structure: Direct analysis of the overlap between translated neighbor sets, as well as Pearson correlation of distance profiles in local neighborhoods (NO, SNC-STATIC, SNC-AVE, SNC-CLOUD)(Karidi et al., 2024).
Domain-level neighborhood coherence: Aggregated alignment within semantic fields demonstrates strong cross-lingual agreement in structured domains (e.g., kinship, quantity, time), and divergence in loosely organized domains (motion, technology) (Karidi et al., 2024).
Universality and structure: Polysemy networks reveal universal conceptual clustering and heavy-tailed neighborhood size distributions robust to geography and environment (Youn et al., 2015).

Concrete neighbor lists (e.g., “amour” $k$ 5 love, fondness, passion in m-BERT+abtt; “kyrka” $k$ 6 church, cathedral, chapel in VecMap+sl) evidence the semantic tightness achievable under different alignment strategies (Vulić et al., 2020).

5. Practical Considerations, Limitations, and Domain Factors

Embedding choice and script effects: Shared-script pairs and subword-sharing boost alignment; static embeddings are sensitive to OOVs, whereas contextualized methods avoid such pathologies (Vulić et al., 2020).
Low-resource and noisy conditions: Polyglot embeddings and context-anchored methods deliver robust cross-lingual neighborhoods even with minimal resources (a few hundred parallel sentences or code-mixed social media) (KhudaBukhsh et al., 2020, Wada et al., 2020).
Parameter-efficient adaptation: Tuning only adapters at high-alignment layers (e.g., TLI on LLMs at layer 2) achieves statistically significant improvements in lexical similarity and generalization to unseen pairs (Ngugi, 18 Jun 2025).
Limitations: Bilingual (pairwise) tuning does not scale quadratically, requiring $k$ 7 models for $k$ 8 languages (contrastive SE tuning). Extremely low-resource settings with $k$ 91k seed pairs remain a barrier in some approaches. Global alignment can mask fine-grained local divergence. Polysemy-based network methods capture only basic vocabulary.

6. Applications and Broader Implications

Cross-lingual lexical neighborhoods underpin:

Cross-lingual topic modeling: Polyglot augmentation of BoW with cross-lingual neighbors (as in GloCTM) enables joint topic models to produce structurally synchronized, language-agnostic topics with improved coherence and semantic alignment (Phat et al., 17 Jan 2026).
Bilingual lexicon induction and translation: Direct neighbor retrieval in joint embedding spaces yields high-coverage lexicons suitable for MT, with competitive or superior P@k to traditional count- or alignment-based systems, especially for under-resourced scenarios (Wada et al., 2020, KhudaBukhsh et al., 2020).
Semantic typology and language comparison: The analysis of polysemy networks and local neighborhood metrics exposes domains of universal semantic clustering versus local cultural divergence, informing cognitive and typological theories (Youn et al., 2015, Karidi et al., 2024).
Model diagnostics and lexical evaluation: Domain-wise and word-wise alignment scores highlight areas of strong and weak cross-lingual agreement, supporting more nuanced error analyses and targeted adaptation (e.g., domain drift or cultural specificity) (Karidi et al., 2024).

7. Conclusions and Research Directions

Cross-lingual lexical neighborhoods provide a foundational abstraction bridging distributional semantics, typology, and multilingual NLP. Recent advances—ranging from context-anchored embeddings, robust polyglot modeling in noisy/low-resource settings, and transformer specialization with large-scale lexical constraints—demonstrate that high-quality, interpretable, and type-level transferable neighborhoods are achievable under resource-lean and heterogenous conditions (Vulić et al., 2022, Green et al., 2022, Ormazabal et al., 2020, Ngugi, 18 Jun 2025). Open challenges include extension to auto-regressive LMs, nuanced handling of polysemy and sense clustering, scaling compositionally to full phrase/sentence alignment, and grounding neighborhood evaluations in downstream, real-world cross-lingual transfer performance (Karidi et al., 2024).