ZAEBUC-Spoken: A Multilingual Multidialectal Arabic-English Speech Corpus (2403.18182v1)
Abstract: We present ZAEBUC-Spoken, a multilingual multidialectal Arabic-English speech corpus. The corpus comprises twelve hours of Zoom meetings involving multiple speakers role-playing a work situation where Students brainstorm ideas for a certain topic and then discuss it with an Interlocutor. The meetings cover different topics and are divided into phases with different language setups. The corpus presents a challenging set for automatic speech recognition (ASR), including two languages (Arabic and English) with Arabic spoken in multiple variants (Modern Standard Arabic, Gulf Arabic, and Egyptian Arabic) and English used with various accents. Adding to the complexity of the corpus, there is also code-switching between these languages and dialects. As part of our work, we take inspiration from established sets of transcription guidelines to present a set of guidelines handling issues of conversational speech, code-switching and orthography of both languages. We further enrich the corpus with two layers of annotations; (1) dialectness level annotation for the portion of the corpus where mixing occurs between different variants of Arabic, and (2) automatic morphological annotations, including tokenization, lemmatization, and part-of-speech tagging.
- Morphologically Annotated Corpus and a Morphological Analyzer for Moroccan and San’ani Yemeni Arabic. In Proceedings of LREC, pages 1300–1306.
- The MGB-2 challenge: Arabic multi-dialect broadcast media recognition. In SLT, pages 279–284.
- The MGB-5 challenge: Recognition and dialect identification of dialectal Arabic speech. In Proceedings of ASRU, pages 1026–1033.
- Speech recognition challenge in the wild: Arabic MGB-3. In Proceedings of the Automatic Speech Recognition and Understanding Workshop, pages 316–322.
- Multi dialect Arabic speech parallel corpora. In Proceedings of the International Conference on Communications, Signal Processing, and their Applications (ICCSPA), pages 1–6.
- The French-Algerian code-switching triggered audio corpus (FACST). In Proceedings of LREC, pages 1468–1473.
- Cairo student code-switch (CSCS) corpus: An annotated Egyptian Arabic-English corpus. In Proceedings of LREC, pages 3973–3977.
- I am borrowing ya mixing?" an analysis of English-Hindi code mixing in facebook. In Proceedings of the first workshop on computational approaches to code switching, pages 116–126.
- The MADAR Arabic Dialect Corpus and Lexicon. In Proceedings of LREC, pages 3387–3396.
- Annotating multi-media/multi-modal resources with ELAN. In Proceedings of LREC, pages 2065–2068.
- Effects of dialectal code-switching on speech modules: A study using Egyptian Arabic broadcast speech. In Proceedings of Interspeech, pages 2382–2386.
- Towards one model to rule all: Multilingual strategy for dialectal code-switching Arabic ASR. In Proceedings of Interspeech, pages 2466–2470.
- A survey of code-switching: Linguistic and social perspectives for language technologies. In Proceedings of ACL-IJCNLP, pages 1654–1666.
- Development of a TV broadcasts speech recognition system for Qatari Arabic. In Proceedings of LREC, pages 3057–3061.
- Charles A Ferguson. 1959. Diglossia. word, 15(2):325–340.
- Björn Gambäck and Amitava Das. 2016. Comparing the level of code-switching in corpora. In Proceedings of LREC, pages 1850–1855.
- Exploring segmentation approaches for neural machine translation of code-switched Egyptian Arabic-English text. In Proceedings of EACL, pages 86–100.
- Unified guidelines and resources for Arabic dialect orthography. In Proceedings of LREC, pages 3628–3637.
- Nizar Habash and David Palfreyman. 2022. ZAEBUC: An annotated Arabic-English bilingual writer corpus. In Proceedings of LREC, pages 79–88.
- Guidelines for annotation of Arabic dialectness. In Proceedings of the LREC Workshop on HLT & NLP within the Arabic world, pages 49–53.
- On Arabic Transliteration. In A. van den Bosch and A. Soudi, editors, Arabic Computational Morphology: Knowledge-based and Empirical Methods, pages 15–22. Springer, Netherlands.
- Nizar Y Habash. 2010. Introduction to Arabic natural language processing, volume 3. Morgan & Claypool Publishers.
- Investigations on speech recognition systems for low-resource dialectal Arabic–English code-switching speech. Computer Speech & Language, 72:101278.
- Collection and analysis of code-switch Egyptian Arabic-English speech corpus. In Proceedings of LREC, pages 3805–3809.
- ArzEn-ST: A three-way speech translation corpus for code-switched Egyptian Arabic-English. In Proceedings of the Arabic Natural Language Processing Workshop (WANLP), pages 119–130.
- ArzEn: A speech corpus for code-switched Egyptian Arabic-English. In Proceedings of LREC, pages 4237–4246.
- Morphosyntactic tagging with pre-trained language models for Arabic and its dialects. In Findings of ACL, pages 1708––1719.
- Manal A Ismail. 2015. The sociolinguistic dimensions of code-switching between Arabic and English by Saudis. International Journal of English Linguistics, 5(5):99.
- Curras: an annotated corpus for the Palestinian Arabic dialect. Language Resources and Evaluation, pages 1–31.
- ALDi: Quantifying the Arabic level of dialectness of text. In Proceedings of EMNLP.
- A Large Scale Corpus of Gulf Arabic. In Proceedings of LREC, pages 4282–4289.
- A morphologically annotated corpus of Emirati Arabic. In Proceedings of LREC, pages 3839–3846.
- Developing and Using a Pilot Dialectal Arabic Treebank. In Proceedings of LREC, pages 443–448.
- The Penn Arabic Treebank: Building a Large-Scale Annotated Arabic Corpus. In Proceedings of the International Conference on Arabic Language Resources and Tools, pages 102–109.
- Developing an Egyptian Arabic Treebank: Impact of Dialectal Morphology on Annotation and Tool Development. In Proceedings of LREC, pages 2348–2354.
- QASR: QCRI Aljazeera speech resource–a large scale annotated Arabic speech corpus. In Proceedings of ACL, pages 2274–2285.
- Universal dependencies 2.0.
- CAMeL tools: An open source python toolkit for Arabic natural language processing. In Proceedings of LREC, pages 7022–7032.
- David Palfreyman and Nizar Y Habash. 2022. Bilingual writers and corpus analysis. Taylor & Francis Group.
- The Kaldi speech recognition toolkit. In Proceedings of ASRU, pages 1–4.
- Language modeling for code-mixing: The role of linguistic theory based synthetic data. In Proceedings of ACL, pages 1543–1553.
- Stanza: A python natural language processing toolkit for many human languages. In Proceedings of ACL: System Demonstration, pages 101–108.
- Surangika Ranathunga and Nisansa de Silva. 2022. Some languages are more equal than others: Probing deeper into the linguistic disparity in the NLP world. In Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing, pages 823–848.
- The SRI speech-based collaborative learning corpus. In Proceedings of Interspeech, pages 1550–1554.
- Universal dependencies for Arabic. In Proceedings of the Workshop for Arabic Natural Language Processing (WANLP), pages 166–176.
- Omar Zaidan and Chris Callison-Burch. 2011. The Arabic online commentary dataset: an annotated dataset of informal Arabic with high dialectal content. In Proceedings of the Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 37–41.
- Gulf Arabic conversational telephone speech LDC2006S43. Web Download. Philadelphia: Linguistic Data Consortium.
- Gulf Arabic conversational telephone speech, transcripts LDC2006T15. Web Download. Philadelphia: Linguistic Data Consortium.
- Levantine Arabic conversational telephone speech, transcripts LDC2007T01. Web Download. Philadelphia: Linguistic Data Consortium.
- BBN/AUB DARPA Babylon Levantine Arabic speech and transcripts LDC2005S08. Web Download. Philadelphia: Linguistic Data Consortium.
- CALLHOME Egyptian Arabic transcripts. Linguistic Data Consortium, Philadelphia.
- Egyptian Colloquial Arabic Lexicon. LDC catalog number LDC99L22.
- Egyptian Arabic Treebank DF Parts 1-8 V2.0 - LDC catalog numbers LDC2012E93, LDC2012E98, LDC2012E89, LDC2012E99, LDC2012E107, LDC2012E125, LDC2013E12, LDC2013E21.
- Fisher Levantine Arabic conversational telephone speech LDC2007S02. Web Download. Philadelphia: Linguistic Data Consortium.
- Paul Boersma and David Weenink. 2022. Praat: doing phonetics by computer [Computer program]. Version 6.2.14, retrieved from https://www.praat.org.
- SRI speech-based collaborative learning corpus LDC2019S01. Web Download. Philadelphia: Linguistic Data Consortium.