CoNLL-SIGMORPHON 2017 Shared Task: Universal Morphological Reinflection in 52 Languages (1706.09031v2)

Published 27 Jun 2017 in cs.CL

Abstract: The CoNLL-SIGMORPHON 2017 shared task on supervised morphological generation required systems to be trained and tested in each of 52 typologically diverse languages. In sub-task 1, submitted systems were asked to predict a specific inflected form of a given lemma. In sub-task 2, systems were given a lemma and some of its specific inflected forms, and asked to complete the inflectional paradigm by predicting all of the remaining inflected forms. Both sub-tasks included high, medium, and low-resource conditions. Sub-task 1 received 24 system submissions, while sub-task 2 received 3 system submissions. Following the success of neural sequence-to-sequence models in the SIGMORPHON 2016 shared task, all but one of the submissions included a neural component. The results show that high performance can be achieved with small training datasets, so long as models have appropriate inductive bias or make use of additional unlabeled data or synthetic data. However, different biasing and data augmentation resulted in disjoint sets of inflected forms being predicted correctly, suggesting that there is room for future improvement.

Authors (11)

Ryan Cotterell (226 papers)
Christo Kirov (16 papers)
John Sylak-Glassman (3 papers)
Géraldine Walther (3 papers)
Ekaterina Vylomova (28 papers)
Patrick Xia (26 papers)
Manaal Faruqui (39 papers)
Sandra Kübler (6 papers)
David Yarowsky (13 papers)
Jason Eisner (56 papers)
Mans Hulden (17 papers)

Citations (190)

View on Semantic Scholar

Summary

The paper introduces a shared task that benchmarks universal morphological reinflection across 52 typologically diverse languages.
The paper employs neural encoder-decoder architectures and data augmentation to address performance across high, medium, and low-resource scenarios.
The paper highlights the potential of leveraging unlabeled corpora and ensemble methods to enhance morphological prediction and tackle linguistic complexities.

Overview of CoNLL-SIGMORPHON 2017 Shared Task: Universal Morphological Reinflection in 52 Languages

The CoNLL-SIGMORPHON 2017 Shared Task focused on the challenges associated with morphological generation across 52 typologically diverse languages. The task involved two main sub-tasks. The first sub-task required systems to predict specific inflected forms of given lemmas. Sub-task 2 extended this to morphological paradigm cell filling, where systems needed to complete entire inflectional paradigms given partial data. Both tasks included conditions with varying amounts of training data, categorized as high, medium, and low-resource scenarios.

Morphological Modeling and Its Impact

Morphology is a crucial aspect of language processing as it interacts with syntax and phonology. Historically, explicit morphological models have been less prioritized compared to other linguistic structures in human language technology (HLT). However, recent advances, particularly those leveraging machine learning methodologies, have renewed interest in computational morphological modeling. Robust morphological generation systems can significantly benefit several HLT tasks, including machine translation and speech recognition.

Task Implementation

The shared task emphasized reliable morphological generation under diverse conditions. The dataset included languages from multiple linguistic families, showcasing various morphological characteristics. For each language, datasets with sparse and complete paradigms were provided. In sub-task 1, sparse datasets containing individual forms were used, while sub-task 2 employed complete paradigms and required systems to predict the missing forms.

System Approaches and Results

Nearly all submissions employed neural approaches, specifically encoder-decoder architectures with recurrent neural networks (RNNs). These models demonstrated high performance in high-resource scenarios. However, where training data was sparse, neural systems required additional techniques to achieve reasonable accuracy. Data augmentation strategies, including self-training with synthetic data and leveraging unlabeled corpora, were crucial for enhancing system performances in low-resource settings.

The results indicate that systems can perform well with small datasets if appropriate inductive biases and additional data sources are utilized. Significant discrepancies were observed between prediction sets of systems, indicating room for improvement through ensemble methods or enhanced generalization techniques.

Implications and Future Directions

This shared task marks a significant step in computational morphological research, underscoring the capabilities and limitations of current neural models in morphology. The task highlighted the necessity to explore external unlabeled corpora and cross-linguistic information for system enhancement. Future research could advance active learning for morphology, aiming to identify crucial training examples for efficient paradigm learning. Additionally, expanding shared tasks to other morphological areas, such as segmentation and tagging, could reinforce the integration between computational methods and theoretical linguistic insights.

Conclusion

The outcomes from the CoNLL-SIGMORPHON 2017 Shared Task provide a comprehensive benchmarking framework for future studies. By releasing the datasets, the task facilitates ongoing research into morphological learning and transduction. Despite the impressive performance of systems demonstrated during the task, the results indicate substantial potential for further development in handling morphological complexities across languages.

PDF Markdown