Universal Dependencies v2: An Evergrowing Multilingual Treebank Collection (2004.10643v1)

Published 22 Apr 2020 in cs.CL

Abstract: Universal Dependencies is an open community effort to create cross-linguistically consistent treebank annotation for many languages within a dependency-based lexicalist framework. The annotation consists in a linguistically motivated word segmentation; a morphological layer comprising lemmas, universal part-of-speech tags, and standardized morphological features; and a syntactic layer focusing on syntactic relations between predicates, arguments and modifiers. In this paper, we describe version 2 of the guidelines (UD v2), discuss the major changes from UD v1 to UD v2, and give an overview of the currently available treebanks for 90 languages.

Citations (490)

View on Semantic Scholar

Summary

The paper introduces refined annotation guidelines and enhanced dependency representations that unify multilingual syntactic analysis.
The paper improves tokenization and morphological feature specification to ensure consistent syntactic annotations across diverse languages.
The paper expands the treebank resources and implements enhanced dependencies to boost semantic interpretation and cross-lingual applications.

Universal Dependencies v2: An Overview of Multilingual Treebank Collection Enhancements

The paper "Universal Dependencies v2: An Evergrowing Multilingual Treebank Collection" by Joakim Nivre et al. outlines significant enhancements from Universal Dependencies version 1 to version 2 (UD v2), contributing to the field of multilingual syntactic annotation. The Universal Dependencies (UD) project aims to establish a cross-linguistically consistent treebank annotation schema, facilitating research in parsing and cross-lingual learning across a multitude of languages.

Key Contributions

The core contributions of UD v2 revolve around the refinement of annotation guidelines, introduction of enhanced dependency representations, and expansion of the multilingual treebank.

Annotation Scheme: UD v2 enforces a syntactic structure rooted primarily in dependency relations between content words. This is achieved through adjustments in tokenization and morphological annotations, ensuring uniformity yet allowing language-specific adaptations when necessary. UD version 2 also introduces enhanced representations which capture implicit syntactic relations, benefiting downstream tasks in natural language understanding.
Morphological and Syntactic Annotation: UD v2 maintains the inventory of universal part-of-speech tags, refining the use of some tags, such as AUX, to encapsulate blending morphosyntactic TAME particles and copula verbs. Syntactic annotations now prioritize predicate-argument structures based primarily on content words rather than function words, facilitating cross-linguistic consistency in syntactic representation.

Major Changes in UD v2

Several pivotal changes in UD version 2 compared to its predecessor include:

Tokenization and Word Segmentation: The relaxation of restrictions on word-internal spaces accommodates syllabic writing systems prevalent in languages like Vietnamese, avoiding distorted syntactic representation by not necessitating multiword tokens for syllables.
Morphological Features: UD v2 expands and refines its set of universal morphological features to better represent linguistic diversity and aligns more closely with the UniMorph project.
Syntactic Relations: Introduction of the obl relation to segregate oblique nominals at the clause level, maintaining consistency between nominal and predicate modifiers. New relations like clf for classifiers and enhancements in conjunction processing have been incorporated.
Enhanced Dependencies: UD v2 also outlines optional enhanced dependencies designed to improve the semantic interpretation of syntactic structures. This includes null nodes for elided predicates, propagation of conjuncts, explicit representation of control and raising, and enriched case information.

Implications and Future Directions

The implications of the UD v2 release are profound for both practical NLP applications and theoretical linguistic research. The standardized yet adaptable framework enables the development of multilingual parsers and promotes rigorous typological research.

Considering future trajectories, ongoing efforts are geared towards broadening the linguistic diversity of the UD treebanks and increasing the volume of data within existing language treebanks. The challenge of maintaining annotation consistency across languages while expanding the dataset remains a significant focus, as the project endeavors to capture the rich typological diversity present in global languages.

In conclusion, UD v2 constitutes a vital step in scaling multilingual NLP, paving the way for continued advancements in cross-linguistic research and applications.

PDF Markdown