2000 character limit reached
RoDia: A New Dataset for Romanian Dialect Identification from Speech (2309.03378v3)
Published 6 Sep 2023 in cs.CL, cs.SD, and eess.AS
Abstract: We introduce RoDia, the first dataset for Romanian dialect identification from speech. The RoDia dataset includes a varied compilation of speech samples from five distinct regions of Romania, covering both urban and rural environments, totaling 2 hours of manually annotated speech data. Along with our dataset, we introduce a set of competitive models to be used as baselines for future research. The top scoring model achieves a macro F1 score of 59.83% and a micro F1 score of 62.08%, indicating that the task is challenging. We thus believe that RoDia is a valuable resource that will stimulate research aiming to address the challenges of Romanian dialect identification. We release our dataset at https://github.com/codrut2/RoDia.
- Findings of the VarDial evaluation campaign 2022. In Proceedings of VarDial, pages 1–13.
- Automatic rating of spontaneous speech for low-resource languages. In Proceedings of SLT, pages 339–345.
- The MGB-5 Challenge: Recognition and Dialect Identification of Dialectal Arabic Speech. In Proceedings of ASRU, pages 1026–1033.
- Speech recognition challenge in the wild: Arabic MGB-3. In Proceedings of ASRU, pages 316–322.
- RoSAC: A Speech Corpus for Transcribing Romanian Emergency Calls. In Proceedings of COMM, pages 1–5.
- wav2vec 2.0: A framework for self-supervised learning of speech representations. In Proceedings of NeurIPS, volume 33, pages 12449–12460.
- The reference corpus of the contemporary Romanian language (CoRoLa). In Proceedings of LREC, pages 1178–1185.
- The NCHLT speech corpus of the South African languages. In Proceedings of SLTU, pages 194–200.
- The MADAR Arabic Dialect Corpus and Lexicon. In Proceedings of LREC, pages 3387–3396.
- Andrei Butnaru and Radu Tudor Ionescu. 2019. MOROCO: The Moldavian and Romanian Dialectal Corpus. In Proceedings of ACL, pages 688–698.
- Findings of the VarDial evaluation campaign 2021. In Proceedings of VarDial, pages 1–11.
- William Chan and Ian Lane. 2015. Deep convolutional neural networks for acoustic modeling in low resource languages. In Proceedings of ICASSP, pages 2056–2060.
- SwissDial: Parallel Multidialectal Corpus of Spoken Swiss German. arXiv preprint arXiv:2103.11401.
- ACTIV-ES: A comparable, cross-dialect corpus of everyday Spanish from Argentina, Mexico, and Spain. In Proceedings of LREC, pages 1733–1737.
- A report on the VarDial evaluation campaign 2020. In Proceedings of VarDial, pages 1–14.
- Mihaela Găman and Radu Tudor Ionescu. 2022. The unreasonable effectiveness of machine learning in Moldavian versus Romanian dialect identification. International Journal of Intelligent Systems, 37(8):4928–4966.
- RSC: A Romanian read speech corpus for automatic speech recognition. In Proceedings of LREC.
- AST: Audio Spectrogram Transformer. In Proceedings of INTERSPEECH, pages 571–575.
- FreCDo: A New Corpus for Large-Scale French Cross-Domain Dialect Identification. In Proceedings of KES, pages 366–373.
- Finnish Dialect Identification: The Effect of Audio and Text. In Proceedings of EMNLP, pages 8777–8783.
- Deep Residual Learning for Image Recognition. In Proceedings of CVPR, pages 770–778.
- Speech Recognition Datasets for Low-resource Congolese Languages. Data in Brief, 52:109796.
- Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic gradient descent. In Proceedings of ICLR.
- Spoken Language Recognition Using Ensemble Classifiers. IEEE Transactions on Audio, Speech, and Language Processing, 15(7):2053–2062.
- A review of deep learning techniques for speech processing. Information Fusion, page 101869.
- SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition. In Proceedings of INTERSPEECH, pages 2613–2617.
- STT4SG-350: A speech corpus for all Swiss German dialect regions. In Proceedings of ACL, pages 1763–1772.
- Robust Speech Recognition via Large-Scale Weak Supervision. In Proceedings of ICML, pages 28492–28518.
- Neural machine translation for low-resource languages: A survey. ACM Computing Surveys, 55(11):1–37.
- Nicolae-Cătălin Ristea and Radu Tudor Ionescu. 2020. Are you Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs. In Proceedings of INTERSPEECH, pages 2102–2106.
- SepTr: Separable Transformer for Audio Spectrogram Processing. In Proceedings of INTERSPEECH, pages 4103–4107.
- ADI17: A Fine-Grained Arabic Dialect Identification Dataset. In Proceedings of ICASSP, pages 8244–8248.
- Steven H. Weinberger and Stephen A. Kunath. 2011. The Speech Accent Archive: towards a typology of English accents. In Proceedings of Corpus-based studies in language use, language learning, and language documentation, pages 265–281. Brill.
- A report on the third VarDial evaluation campaign. In Proceedings of VarDial, pages 1–16.
- AccentSpeech: Learning Accent from Crowd-sourced Data for Target Speaker TTS with Accents. In Proceedings of ISCSLP, pages 76–80.