PolyGloss: Multilingual Gloss Corpus & Neural Models

Updated 23 January 2026

PolyGloss is a multilingual sense-annotated gloss corpus and neural model suite, offering extensive coverage in over 263 languages and enabling high-precision semantic annotation.
It integrates cross-lingual data aggregation, NASARI refinement, and neural segmentation techniques to achieve state-of-the-art performance in gloss prediction and language documentation.
The resource significantly advances computational language documentation and semantic NLP, facilitating low-resource language processing with rapid adaptation and multi-source knowledge integration.

PolyGloss encompasses both a large-scale multilingual sense-annotated gloss corpus and a suite of advanced neural models for interlinear glossing, segmentation, and language documentation. Across its multiple instantiations, PolyGloss addresses standardized sense annotation for definitional text in hundreds of languages, automated prediction of morpheme-level glosses and boundaries in low-resource settings, and the integration of diverse external knowledge sources to enhance gloss prediction accuracy. Its datasets and models are foundational for semantic NLP, lexicography, and machine-assisted field linguistics, offering both broad coverage and state-of-the-art performance across numerous evaluation regimes (Collados et al., 2016, Ginn et al., 16 Jan 2026, Yang et al., 2024).

1. Multilingual Disambiguated Gloss Corpus

PolyGloss, as introduced in "A Large-Scale Multilingual Disambiguation of Glosses" (Collados et al., 2016), is a globally comprehensive corpus of sense-annotated textual definitions (glosses) spanning 263 languages. Definitions originate from five major lexical and encyclopedic sources: WordNet, Wiktionary, Wikidata, OmegaWiki, and Wikipedia. Each definition is annotated with BabelNet synset identifiers, covering both concepts and named entities, with a dual focus on (i) maximizing coverage (38.8M unique glosses, 8.7M BabelNet synsets) and (ii) ensuring high annotation precision.

The pipeline for constructing PolyGloss involves:

Cross-lingual Data Aggregation: For each BabelNet synset, all glosses across resources and languages are concatenated to create a multilingual context block.
Text Preprocessing: Tokenization is applied via polyglot tokenizers (>165 languages), and PoS tagging uses the Stanford tagger for 30 languages.
Disambiguation: Babelfy 1.0 is used for high-coverage sense tagging (bfScore), falling back to Most Common Sense if bfScore < 0.7.
Refinement: NASARI vector semantic similarity further refines annotations below confidence thresholds, yielding a high-precision subset (coverage ≈65%, precision >90%).

Intrinsic evaluation demonstrates that context-rich multilingual disambiguation outperforms isolated gloss tagging (e.g., English: 84.3% vs. 81.2% pre-refinement; 95.1% post-refinement at 64.8% coverage).

2. Data Format, Access, and Integration

PolyGloss is distributed as language-partitioned, resource-specific UTF-8 XML files. Each <definition> contains the text, metadata, and one or more <annotation> elements specifying BabelNet synsets, annotation source, and disambiguation scores (bfScore, coherenceScore, nasariScore). Two corpus variants are available:

Complete: Maximizes annotation coverage.
High-Precision: Retains only the most semantically coherent tags via NASARI refinement.

Usage is facilitated by standard XML parsing tools. There is currently no API; users download and process files directly from http://lcl.uniroma1.it/disambiguated-glosses.

The structure supports seamless integration into NLP pipelines. In Open Information Extraction (DefIE), for example, PolyGloss annotations supplant Babelfy’s on-the-fly tagging, reducing error propagation. For sense clustering, NASARI vectors are recomputed over the enriched taxonomy to yield more granular and accurate sense groups.

3. Neural Models for Joint Segmentation and Glossing

PolyGloss further designates a family of multilingual neural models for joint prediction of morpheme segmentation and interlinear glossing from raw text, as introduced in "Massively Multilingual Joint Segmentation and Glossing" (Ginn et al., 16 Jan 2026). Model architecture is centered on ByT5-base, a byte-level encoder–decoder Transformer (12 layers each, 220M parameters), selected for robustness to cross-lingual orthography and exact character-level segmentation.

Key features of the modeling pipeline include:

Pretraining: An extended corpus of 353,266 interlinear glossed text (IGT) examples across 2,077 languages, drawn from GlossLM, the Fieldwork dataset, and IMTVault.
Joint Objective: Combined segmentation and glossing objectives, formulated as cross-entropy over both segmentation (L_seg) and glossing (L_gloss) with equal weights; no explicit alignment loss during pretraining.
Format Investigation: Assessment of multitask, concatenated, and interleaved output formats, with the interleaved format providing perfect alignment (Align = 1.0).

Empirical results show PolyGloss sets new state-of-the-art benchmarks:

Model	MER ↓	Seg F₁ ↑	Align ↑
PolyGloss (interleaved)	0.234	0.862	1.000
GlossLM	0.639	—	0.984
Open-source LLM (Qwen 0.6B)	0.839	0.167	0.661

MER (Morpheme Error Rate) is halved versus GlossLM, while segmentation F₁ is competitive with specialized monolingual models. LoRA-style low-rank adaptation enables rapid transfer to new languages with minimal data.

4. Multi-Source Knowledge Integration for Low-Resource Glossing

In low-resource settings, PolyGloss refers to a multi-source glossing architecture combining neural encoders, translation pairs, dictionaries, and LLM-driven post-processing (Yang et al., 2024). The system is formalized as a model $f_\theta$ mapping raw character sequences and auxiliary signals (sentence/token translations, dictionary, in-context LLM features) to morpheme-level gloss outputs.

Principal components:

Character-level BiLSTM encoder processes input orthography.
Unsupervised morpheme segmentation via a forward–backward algorithm.
Gloss decoder: LSTM operating at the character level (for lexical morphemes) and morpheme-label level for grammatical affixes.
Cross-attention over translation representations (T5-large, BERT, or char-BiLSTM), aiding in stem disambiguation.
LLM-based post-correction: Prompts (GPT-4, LLaMA-3) with similarity-selected examples and dictionaries further improve infrequent stem accuracy.

In SIGMORPHON 2023 tasks, T5+Attn+Chr (PolyGloss) achieves average word-level accuracy gains of +4.0 pp in full-data and +8.6 pp in ultra-low-resource settings over strong shared-task baselines. On the extremely low-resource Gitksan, LLM post-correction and dictionary integration further elevate performance to 31.3% word-level accuracy.

5. Evaluation, Benchmarks, and Use Cases

Across PolyGloss resources and modeling paradigms, evaluation is multifaceted:

Sense annotation corpus (intrinsic): Precision reaches >90% in high-precision mode, though at a coverage cost (≈65% of annotations retained in this variant).
Extrinsic tasks: Incorporation into DefIE increases both extraction yield and precision, while sense clustering experiments show NASARI+PolyGloss outperforms vanilla NASARI by 5.6 pp accuracy and 5.9 pp F1 on SemEval-925.
Neural glossing models: On nine evaluation languages, PolyGloss attains state-of-the-art or near state-of-the-art on morpheme error rate (MER), segmentation F₁, and alignment. Multilingual transfer is notably beneficial for typologically low-resource languages.

Evaluation Type	Result/Statistic
High-precision sense annotation	Precision >90%, Coverage ≈65%
DefIE triples (relations/precision)	184 (+13 vs baseline), 87.2%
NASARI clustering (SemEval Acc.)	89.1% (vs 85.7% default NASARI)
PolyGloss model (MER/Seg F1/Align)	0.234 / 0.862 / 1.0

6. Limitations and Areas for Development

Identified limitations of PolyGloss in both its corpus and neural modeling instantiations include:

Coverage trade-off: High-precision annotation mode discards ≈35% of possible tags, especially for non-nouns (verbs, adjectives, adverbs).
Resource dependence: Corpus coverage is bounded by BabelNet’s lexicon and synset expansion; very low-resource languages may be poorly represented.
Neural models: Performance gains from additional knowledge sources (e.g., dictionaries, translations) plateau as more IGT data becomes available. Spanish-language translation utility for LLM prompting is lower than English, possibly due to LLM bias.
Engineering constraints: PolyGloss XML corpora lack an HTTP API for fine-grained access. In neural models, character-level LSTM decoders are used exclusively, leaving Transformer-based decoder variants underexplored.

Future work is proposed in directions such as incremental corpus updates as BabelNet expands, integration of longer contexts (full-article paragraphs), neural contextual encoders (BERT, Transformer), and API-based access. Low-rank adaptation techniques, already validated for rapid language transfer, remain a promising means for model extensibility.

7. Context and Significance in Semantic NLP and Language Documentation

PolyGloss provides standardized, multilingual, and sense-annotated gloss data essential for semantic linking, knowledge extraction, and lexicography at scale (Collados et al., 2016). The neural PolyGloss models advance the state-of-the-art in computational language documentation, both in terms of metric performance and adaptability to the realities of field linguistics—especially for languages with limited annotated resources (Ginn et al., 16 Jan 2026, Yang et al., 2024). Through modular integration of external linguistic expertise, large-scale register-specific corpora, and parameter-efficient adaptation protocols, PolyGloss establishes itself as a pivotal toolset for both large-scale semantic NLP applications and the ongoing documentation of the world’s linguistic diversity.

Markdown Report Issue Upgrade to Chat

References (3)

A Large-Scale Multilingual Disambiguation of Glosses (2016)

Massively Multilingual Joint Segmentation and Glossing (2026)

Multiple Sources are Better Than One: Incorporating External Knowledge in Low-Resource Glossing (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to PolyGloss.

PolyGloss: Multilingual Gloss Corpus & Neural Models

1. Multilingual Disambiguated Gloss Corpus

2. Data Format, Access, and Integration

3. Neural Models for Joint Segmentation and Glossing

4. Multi-Source Knowledge Integration for Low-Resource Glossing

5. Evaluation, Benchmarks, and Use Cases

6. Limitations and Areas for Development

7. Context and Significance in Semantic NLP and Language Documentation

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

PolyGloss: Multilingual Gloss Corpus & Neural Models

1. Multilingual Disambiguated Gloss Corpus

2. Data Format, Access, and Integration

3. Neural Models for Joint Segmentation and Glossing

4. Multi-Source Knowledge Integration for Low-Resource Glossing

5. Evaluation, Benchmarks, and Use Cases

6. Limitations and Areas for Development

7. Context and Significance in Semantic NLP and Language Documentation

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research