The SIGMORPHON 2022 Shared Task on Morpheme Segmentation (2206.07615v1)
Abstract: The SIGMORPHON 2022 shared task on morpheme segmentation challenged systems to decompose a word into a sequence of morphemes and covered most types of morphology: compounds, derivations, and inflections. Subtask 1, word-level morpheme segmentation, covered 5 million words in 9 languages (Czech, English, Spanish, Hungarian, French, Italian, Russian, Latin, Mongolian) and received 13 system submissions from 7 teams and the best system averaged 97.29% F1 score across all languages, ranging English (93.84%) to Latin (99.38%). Subtask 2, sentence-level morpheme segmentation, covered 18,735 sentences in 3 languages (Czech, English, Mongolian), received 10 system submissions from 3 teams, and the best systems outperformed all three state-of-the-art subword tokenization methods (BPE, ULM, Morfessor2) by 30.71% absolute. To facilitate error analysis and support any type of future studies, we released all system predictions, the evaluation script, and all gold standard datasets.
- Khuyagbaatar Batsuren (9 papers)
- Gábor Bella (13 papers)
- Aryaman Arora (26 papers)
- Viktor Martinović (1 paper)
- Kyle Gorman (11 papers)
- Zdeněk Žabokrtský (4 papers)
- Amarsanaa Ganbold (9 papers)
- Šárka Dohnalová (1 paper)
- Magda Ševčíková (1 paper)
- Kateřina Pelegrinová (1 paper)
- Fausto Giunchiglia (72 papers)
- Ryan Cotterell (226 papers)
- Ekaterina Vylomova (28 papers)