Papers
Topics
Authors
Recent
Search
2000 character limit reached

Universal Dependencies Dataset

Updated 2 May 2026
  • Universal Dependencies Dataset is a unified syntactic annotation framework featuring universal POS tags, dependency relations, and multilingual treebanks.
  • It supports multilingual natural language processing by standardizing annotations across over 90 languages and diverse text and speech domains.
  • Annotation workflows combine manual efforts, conversion techniques, and rigorous quality control to ensure high accuracy and consistency.

Universal Dependencies (UD) Dataset

Universal Dependencies (UD) is a cross-linguistically consistent framework for syntactic annotation, designed to provide a high-quality and extensible set of human- and machine-annotated treebanks for a wide variety of languages and registers. UD aims to facilitate multilingual NLP, syntactic typology, cross-lingual transfer, and both theoretical and practical linguistic research through unified part-of-speech (POS) tagging, morphological feature annotation, and universal dependency relations.

1. Design Principles and Annotation Schema

The UD annotation schema is organized around three primary elements: universal POS tags (UPOS), universal morphological features, and a defined set of dependency relations. These elements are specified for both written and spoken language, and annotation guidelines require consistent treatment of sentence segmentation, tokenization, and relation assignment even in the presence of language-specific or genre-specific phenomena.

  • Universal POS tags: The standard UD set comprises 17 categories (e.g., NOUN, VERB, ADP, ADV, DET, PRON, ADJ, PROPN, AUX, CCONJ, SCONJ, NUM, PART, INTJ, SYM, X, PUNCT). All annotated corpora map their native POS inventories to this set (Sriwirote et al., 2024, Rasooli et al., 2020, Braggaar et al., 2021).
  • Dependency relations: UD uses a defined set of approximately 37 core relations (e.g., nsubj, obj, iobj, obl, amod, advmod, case, aux, cop, conj, cc, det, compound, flat, fixed, mark, root, parataxis, appos, dislocated, discourse, reparandum, goeswith, orphan, etc.), with extensions for language- or genre-specific phenomena (Sriwirote et al., 2024, Blaschke et al., 2024, Braggaar et al., 2021, Davidson et al., 2019).
  • Morphological features: Feature inventories are extended on a language-by-language basis, but only overtly marked features are annotated, and all features comply with the universal inventory (e.g., Case, Number, Gender, Tense, Person, Aspect, Politeness, etc.) (Raj et al., 2022).

Annotation proceeds in CoNLL-U format, encoding token-level fields (ID, FORM, LEMMA, UPOS, XPOS, FEATS, HEAD, DEPREL, DEPS, MISC), supporting integration with the complete UD ecosystem.

2. Treebank Coverage and Diversity

The UD project encompasses a vast range of languages, registers, and linguistic phenomena.

  • Language breadth: UD covers more than 90 languages, each documented by at least one treebank. Datasets such as the Persian UD treebank (29,107 sentences, 509,000 tokens) (Rasooli et al., 2020), Thai UD treebank (3,627 sentences, ≈90,000 tokens) (Sriwirote et al., 2024), and datasets for low-resource languages like Magahi (945 sentences) and Braj (500 sentences) (Raj et al., 2022) exemplify this range.
  • Register and domain: UD includes resources for formal texts (news, Wikipedia, academic writing (Sriwirote et al., 2024)), conversational and spoken language (CHILDES—child-caregiver interactions, ConvBank—human-machine dialogue, radio broadcast speech (Yang et al., 28 Apr 2025, Davidson et al., 2019, Braggaar et al., 2021)), dialectal and non-standard varieties (Bavarian (Blaschke et al., 2024), code-switched Frisian–Dutch (Braggaar et al., 2021)), and learner corpora (ESL TLE (Berzak et al., 2016)).
  • Synthetic and typological extension: The Galactic Dependencies resource synthesizes more than 50,000 artificial treebanks by reordering dependents from real UD sources to simulate typological diversity, enabling robust analysis of low-resource and typologically distant languages (Wang et al., 2017). Typology datasets built from UD derive continuous-valued word-order metrics across dozens of languages (Baylor et al., 2024).
Dataset/Lang. #Sents #Tokens Domain Special Feature
Thai UD (Sriwirote et al., 2024) 3,627 ~90,000 text Covers Wikipedia, news, essays
Persian UD (Rasooli et al., 2020) 29,107 509,000 text Conversion from PerDT, high UD-compat.
CHILDES-English (Yang et al., 28 Apr 2025) 48,183 (gold) 236,941 speech Children/caregivers; +1M silver
MaiBaam Bavarian (Blaschke et al., 2024) 1,070 15,023 text/speech Multi-dialect, five genres
Magahi (Raj et al., 2022) 945 13,343 text Low-resource, Indo-Aryan

These facts show the cross-linguistic and cross-domain breadth of the UD corpus, an essential property for multilingual parsing and typological generalization.

3. Annotation Workflows and Quality Control

UD treebanks are produced either by manual annotation (from scratch or atop existing resources), conversion from pre-existing dependency/constituent corpora, or through a hybrid approach.

  • Manual annotation: Involves iterative, multi-rater workflows. Annotators label sentence batches, measure inter-annotator agreement (IAA) (metrics: POS κ, LAS, UAS), and adjudicate disagreements, refining the guidelines as needed (Braggaar et al., 2021, Sriwirote et al., 2024). For example, in the spoken Frisian–Dutch dataset, three rounds of annotation on 150 utterances increased POS accuracy from 69.5% to 89.7%, UAS from 72.3% to 80.1%, and LAS from 60.9% to 71.4%, with improvement statistically validated (McNemar test, p<0.01p<0.01) (Braggaar et al., 2021).
  • Conversion: Gold-standard resources (Stanford Dependencies, Penn Treebank, PerDT, Alpino, LassySmall) are mapped to UD through deterministic rules and/or machine learning, with corpus-specific corrections for function word mapping, non-projectivity, compound/flat, and relation assignment (see (Rasooli et al., 2020, Peng et al., 2019, Blaschke et al., 2024)).
  • Quality assessment: Inter-annotator agreement is explicitly reported using Cohen’s κ (UPOS κ=0.92, relations κ=0.84 in TUD) (Sriwirote et al., 2024). LAS and UAS metrics are standard for measuring parser- or annotator-level agreement:

UAS=1Ni=1N1(h^i=hi),LAS=1Ni=1N1(h^i=hi^i=i)\mathrm{UAS} = \frac{1}{N}\sum_{i=1}^N \mathbf{1}(\hat h_i = h_i), \quad \mathrm{LAS} = \frac{1}{N}\sum_{i=1}^N \mathbf{1}(\hat h_i = h_i \wedge \hat\ell_i = \ell_i)

where NN excludes punctuation roots (Braggaar et al., 2021).

  • Genre adaptation: Dedicated schemes such as SCUD for spoken dialog (Davidson et al., 2019) or explicit adaptation for code-switching hinge on both schema extension and robust workflow design.

4. Special Challenges and Solutions in UD annotation

UD annotation must address diverse issues arising from language- and domain-specific phenomena:

  • Spoken and disfluent speech: Discourse markers, fillers, false starts, and repairs are coded using relations such as discourse, reparandum, flat, and goeswith; placeholder nodes mark ellipsis (Davidson et al., 2019). In code-switched utterances, cross-language coreference and disfluent segmentation are systematically treated via UD-native relations (expl, dislocated, discourse) (Braggaar et al., 2021).
  • Non-canonical varieties and learner data: Treebanks for learner or dialectal data (e.g., TLE, MaiBaam) adhere to a literal surface annotation principle—errors are annotated as realized, not corrected, unless a token is so malformed as to lose syntactic plausibility (Berzak et al., 2016). Dialect annotation calls for both customized guidelines and flexible tokenization, favoring preservation of orthographic and morphological idiosyncrasies (Blaschke et al., 2024).
  • Low-resource and code-switched languages: Small datasets require iterative annotation, leveraging pre-existing monolingual UD treebanks for guideline drafting but remaining open to adaptation (Braggaar et al., 2021, Raj et al., 2022). Cross-lingual coreference is handled using expl/ref rather than new relation labels.
  • Multilayer annotation and conversion: Incorporating entity, coreference, and disfluency layers improves the accuracy of conversion to UD (from ≈98% to >99.5% accuracy in GUM), ensuring correct assignment of complex relations (flat, compound, dislocated, reparandum) (Peng et al., 2019).

5. Downstream Applications and Impact

UD datasets underpin a broad spectrum of computational linguistics and NLP research:

  • Supervised and cross-lingual parsing: UD treebanks are the de facto standard for training, evaluating, and benchmarking dependency parsers, particularly for low-resource and cross-lingual transfer scenarios (Sriwirote et al., 2024, Wang et al., 2017, Rasooli et al., 2020).
  • Typology and linguistic research: Large-scale typological investigations, including gradient word-order typology, are enabled directly from UD annotations via continuous-valued feature extraction (Baylor et al., 2024).
  • Child and learner language modeling: Datasets like UD-English-CHILDES and TLE support research in child language acquisition, developmental syntax, L2 parsing, and syntax-aware grammatical error correction (Yang et al., 28 Apr 2025, Berzak et al., 2016).
  • Automated conversion and benchmarking: Comparison of rule-based and neural approaches to enhanced dependency conversion (e.g., for coordinate propagation) demonstrate the utility of machine learning over fixed heuristics for higher-recall, semantically faithful UD graphs (Grünewald et al., 2021).
  • Synthetic data augmentation: Resources such as Galactic Dependencies augment data for supervised grammar induction, transfer learning, and typological gap-filling, providing treebanks for thousands of “unearthly” languages (Wang et al., 2017).

6. Licensing, Format, and Reuse Practices

All official UD treebanks follow open-source licensing (typically CC BY-SA 4.0), are distributed in CoNLL-U UTF-8 format, and include scripts for manipulation, parsing, and basic statistics (Sriwirote et al., 2024, Raj et al., 2022, Yang et al., 28 Apr 2025). Repository URLs and accompanying documentation are standard. Researchers are encouraged to cite datasets appropriately, contribute new resources, and to validate annotation using established guidelines and validators.

Consistent structure and robust licensing have enabled broad adoption, repeatable experiments, and the integration of UD resources into most mainstream NLP libraries (UDPipe, Stanza, spaCy, etc.).

7. Recommendations and Best Practices

The Universal Dependencies dataset ecosystem, by harmonizing annotation across over 90 languages and diverse genres, provides a foundational standard for syntactic parsing, typology, and computational linguistics, and continues to expand via community-driven multilingual and multi-domain resource creation.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Universal Dependencies Dataset.