GRDD: A Dataset for Greek Dialectal NLP (2308.00802v4)
Abstract: In this paper, we present a dataset for the computational study of a number of Modern Greek dialects. It consists of raw text data from four dialects of Modern Greek, Cretan, Pontic, Northern Greek and Cypriot Greek. The dataset is of considerable size, albeit imbalanced, and presents the first attempt to create large scale dialectal resources of this type for Modern Greek dialects. We then use the dataset to perform dialect idefntification. We experiment with traditional ML algorithms, as well as simple DL architectures. The results show very good performance on the task, potentially revealing that the dialects in question have distinct enough characteristics allowing even simple ML models to perform well on the task. Error analysis is performed for the top performing algorithms showing that in a number of cases the errors are due to insufficient dataset cleaning.
- Yoryia Agouraki. 2001. The position of clitics in cypriot greek. Modern Greek Dialects and Linguistics Theory, 1(1):1–18.
- Part-of-speech tagging on an endangered language: a parallel griko-italian resource. arXiv preprint arXiv:1806.03757.
- Stergios Chatzikyriakidis. 2010. Clitics in four dialects of Modern Greek: A dynamic account. Ph.D. thesis, University of London.
- Stergios Chatzikyriakidis. 2012. A dynamic account of clitic positioning in cypriot greek. Lingua, 122(6):642–672.
- 6.4. romeyka. In Elena Anagnostopoulou, Christina Sevdali, and Dionysis Mertyris, editors, The emergence of prepositional genitives in Northern Greek beyond suppletion. Oxford: OUP.
- Kleanthes K Grohmann. 2009. Focus on clefts: a perspective from cypriot greek. Selected papers on theoretical and applied linguistics, 18:157–165.
- Properties of wh-question formation in cypriot greek. In Proceedings of the 2nd International Conference on Modern Greek Dialects and Linguistic Theory (Mytilene, Greece: 30 September–3 October 2004). Patras: University of Patras, pages 83–98.
- Harris Hadjidas and Maria C Vollmer. 2015. Multi-cast cypriot greek. Multi-CAST: Multilingual corpus of annotated spoken texts.
- Georgios Hatzidakis. 1905. Mε𝜀\varepsilonitalic_εσ𝜎\sigmaitalic_σα𝛼\alphaitalic_αι𝜄\iotaitalic_ιω𝜔\omegaitalic_ων𝜈\nuitalic_νι𝜄\iotaitalic_ικ𝜅\kappaitalic_κα´´𝛼\acute{\alpha}over´ start_ARG italic_α end_ARG κ𝜅\kappaitalic_κα𝛼\alphaitalic_αι𝜄\iotaitalic_ι nε´´𝜀\acute{\varepsilon}over´ start_ARG italic_ε end_ARGα𝛼\alphaitalic_α ε𝜀\varepsilonitalic_ελ𝜆\lambdaitalic_λλ𝜆\lambdaitalic_λη𝜂\etaitalic_ην𝜈\nuitalic_νι𝜄\iotaitalic_ικ𝜅\kappaitalic_κα´´𝛼\acute{\alpha}over´ start_ARG italic_α end_ARG. Aθ𝜃\thetaitalic_θη´normal-´𝜂\acute{\eta}over´ start_ARG italic_η end_ARGν𝜈\nuitalic_να𝛼\alphaitalic_α: Bι𝜄\iotaitalic_ιβ𝛽\betaitalic_βλ𝜆\lambdaitalic_λι𝜄\iotaitalic_ιoθ𝜃\thetaitalic_θη´normal-´𝜂\acute{\eta}over´ start_ARG italic_η end_ARGκ𝜅\kappaitalic_κη𝜂\etaitalic_η Mα𝛼\alphaitalic_αρ𝜌\rhoitalic_ρα𝛼\alphaitalic_ασ𝜎\sigmaitalic_σλ𝜆\lambdaitalic_λη´normal-´𝜂\acute{\eta}over´ start_ARG italic_η end_ARG.
- Greed: cataloguing and encoding modern greek dialectal oral corpora. Proceedings of CatCod, Orleans, France.
- Marianne Katsoyannou. 1995. Le parler gréco de Gallicianò (Italie): description d’une langue en voie de disparition. Ph.D. thesis, Paris 7.
- Peter Mackridge. 1985. The Modern Greek language: A descriptive analysis of standard Modern Greek. Oxford University Press, USA.
- Peter Mackridge. 1987. Greek-speaking moslems of north-east turkey: prolegomena to a study of the ophitic sub-dialect of pontic. Byzantine and Modern Greek Studies, 11:115–137.
- Velar fronting in modern greek dialects. Modern Greek Dialects and Linguistics Theory, 5(1):272–286.
- Dimitris Michelioudakis and Ioanna Sitaridou. 2016. Recasting the multiple-wh typology: Evidence from pontic greek varieties.
- Brian Newton. 1972. Cypriot Greek: Its phonology and inflections. De Gruyter.
- Konstantinos Ntinas. 2005. The Dialect of Kozani: Phonetics-Phonology, Morphology, Syntax, Vocabulary. Kozani Book and Reading Institute.
- Anthimos Papadopoulos. 1955. Historical grammar of pontic greek.
- Katerina Papantoniou and Yannis Tzitzikas. 2020. Nlp for the greek language: a brief survey. In 11th Hellenic Conference on Artificial Intelligence, pages 101–109.
- Gerhard Rohlfs and Salvatore Sicuro. 1977. Grammatica storica dei dialetti italogreci:(calabria, salento). (No Title).
- Hanna Sababa and Athena Stassopoulou. 2018. A classifier to distinguish between cypriot greek and standard modern greek. In 2018 Fifth International Conference on Social Networks Analysis, Management and Security (SNAMS), pages 251–255. IEEE.
- Laurentia Schreiber. 2018. 6.4. romeyka. In The Languages and Linguistics of Western Asia, pages 892–934. De Gruyter Mouton.
- Ioanna Sitaridou. 2013. Greek-speaking enclaves in pontus today: The documentation and revitalization of romeyka. Keeping languages alive. language endangerment: documentation, pedagogy and revitalization, pages 98–112.
- Ioanna Sitaridou and Stergios Chatzikyriakidis. 2012. Cultural survival shifts focus: The case of pontic greek. When empires clash: Modern-day outcomes of historical Greek and Turkish language encounters”, MedWorlds, 4:29.
- Ioanna Sitaridou and Maria Kaltsa. 2014. Contrastivity in pontic greek. Lingua, 146:1–27.
- Charalambos Themistocleous. 2019. Dialect classification from a single sonorant sound using deep neural networks. Frontiers in Communication, 4:64.
- AI ΘΘ\Thetaroman_Θα𝛼\alphaitalic_αβ𝛽\betaitalic_βω´´𝜔\acute{\omega}over´ start_ARG italic_ω end_ARGρ𝜌\rhoitalic_ρη𝜂\etaitalic_ης𝜍\varsigmaitalic_ς. 1994. Tα𝛼\alphaitalic_α γ𝛾\gammaitalic_γλ𝜆\lambdaitalic_λω𝜔\omegaitalic_ωσ𝜎\sigmaitalic_σσ𝜎\sigmaitalic_σι𝜄\iotaitalic_ικ𝜅\kappaitalic_κα´´𝛼\acute{\alpha}over´ start_ARG italic_α end_ARG ι𝜄\iotaitalic_ιδ𝛿\deltaitalic_δι𝜄\iotaitalic_ιω´´𝜔\acute{\omega}over´ start_ARG italic_ω end_ARGμ𝜇\muitalic_μα𝛼\alphaitalic_ατ𝜏\tauitalic_τα𝛼\alphaitalic_α τ𝜏\tauitalic_τoυ𝜐\upsilonitalic_υ ν𝜈\nuitalic_νoμ𝜇\muitalic_μoν´´𝜈\acute{\nu}over´ start_ARG italic_ν end_ARG koζ𝜁\zetaitalic_ζα´´𝛼\acute{\alpha}over´ start_ARG italic_α end_ARGν𝜈\nuitalic_νη𝜂\etaitalic_ης𝜍\varsigmaitalic_ς ω𝜔\omegaitalic_ως𝜍\varsigmaitalic_ς β𝛽\betaitalic_βóρ𝜌\rhoitalic_ρε𝜀\varepsilonitalic_ει𝜄\iotaitalic_ια𝛼\alphaitalic_α κ𝜅\kappaitalic_κα𝛼\alphaitalic_αι𝜄\iotaitalic_ι oι𝜄\iotaitalic_ι κ𝜅\kappaitalic_κυ𝜐\upsilonitalic_υρ𝜌\rhoitalic_ρι𝜄\iotaitalic_ιóτ𝜏\tauitalic_τε𝜀\varepsilonitalic_ερ𝜌\rhoitalic_ρε𝜀\varepsilonitalic_ες𝜍\varsigmaitalic_ς ι𝜄\iotaitalic_ιδ𝛿\deltaitalic_δι𝜄\iotaitalic_ιoρ𝜌\rhoitalic_ρρ𝜌\rhoitalic_ρυ𝜐\upsilonitalic_υθ𝜃\thetaitalic_θμ𝜇\muitalic_μι´´𝜄\acute{\iota}over´ start_ARG italic_ι end_ARGε𝜀\varepsilonitalic_ες𝜍\varsigmaitalic_ς τ𝜏\tauitalic_τoυ𝜐\upsilonitalic_υς𝜍\varsigmaitalic_ς. Mα𝛼\alphaitalic_ακ𝜅\kappaitalic_κε𝜀\varepsilonitalic_εδ𝛿\deltaitalic_δoν𝜈\nuitalic_νι𝜄\iotaitalic_ικ𝜅\kappaitalic_κα´normal-´𝛼\acute{\alpha}over´ start_ARG italic_α end_ARG, 29:295–307.
- M Triandafyllidis. 1981. Neoeliniki gramatiki: Istoriki isagogi [modern greek grammar: Historical introduction]. Thessaloniki, Aristotle University.
- Stavroula Tsiplakou. 2009. Code-switching and code-mixing between related varieties: establishing the blueprint. The International Journal of Humanities, 6(12):49–66.
- Stavroula Tsiplakou. 2014. How mixed is a ‘mixed’system?: The case of the cypriot greek koiné. Linguistic Variation, 14(1):161–178.
- Stergios Chatzikyriakidis (9 papers)
- Chatrine Qwaider (4 papers)
- Ilias Kolokousis (1 paper)
- Christina Koula (2 papers)
- Dimitris Papadakis (2 papers)
- Efthymia Sakellariou (2 papers)